RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces

Ehrhardt et al., 2020

Source: Ehrhardt et al., 2020

Summary

Presents RELATE, which learns to generate physically plausible scenes and videos of multiple interacting objects
Combines an object-centric GAN with an explicit model of correlations between individual objects
Learns a physically-interpretable parameterization that generate realistic videos and supports physical scene editing
Links: [ website ] [ pdf ]

Image generation with GANs produce realistic images, but generally have parameterizations that are not interpretable
Adding structure to the latent space gives partial physical interprtability
- E.g. BlockGAN, which incorporates concepts like position and orientation, but assumes objects are mutually independent
RELATE leverages the architectural biases of BlockGAN to model correlations between latent object state variables

Scene composition and rendering module:
- Starts by independently sampling random appearance parameters, $z_{0}, \dots, z_{K}$ , for $K$ objects in the scene and the background
- Map appearance parameters for objects and background to tenser $Ψ \in R^{H \times H \times C}$ , via two separate learned decoders
- Each foreground object also has a corresponding pose parameter $θ_{k} \in R^{2}$ which represents a 2D translation
- Foreground objects and background are composed into overall scene tensor via element-wise max pooling
- Final decoder network renders complete scene as an image
Interaction module:
- Does not assume pose parameters are independent, unlike BlockGAN
- First sample a vector of $K$ i.i.d. poses
- Then pass this vector into a correction network, based on Neural Physics Engine, that remaps the initial configuration accounting for the correlation between object locations and appearances
- Apply same correction function to each object’s pose parameter in parallel, to enforce symmetry
To make dynamic predictions, can learn object velocities using NPE style updates and use them to update pose parameters
Learning objective is a combination of the GAN discriminator loss and style loss for the generated images, and $l_{2}$ loss of a position regressor network that predicts the location of objects given a generated image

Baselines:
- GENESIS: parameterises a spatial GMM over images which is decoded from a set of object-centric latent variables
- OCF: explicitly represents the 2D position and depth of each object, as well as an embedding of its segmentation mask and appearance
Datasets:
- BallsInBowl: two balls in elliptical bowl
- CLEVR: cluttered tabletops
- ShapeStacks: block stacking
- RealTraffic: busy street intersection, with one to six cars
Metrics:
- Frechet Inception Distance (FID): quantifies the similarity between the distribution of generated samples and real world samples
- Frechet Video Distance (FVD): considers distribution over videos to capture temporal coherence, in addition to quality of each frame
Ablations:
- BlockGAN2D: removes spatial correlation module and position regression loss
- w/o residual: removes residual in pose correction network
- w/o pose. loss: removes position loss
- Each component of RELATE yields improvement in FID on BallsInBowl
RELATE beats baselines on all datasets although BlockGAN2D is close on CLEVR-5 and ShapeStacks
Scene editing to change the position or appearance of objects works to an extent, and can generate images with more/less objects than seen in training
RELATE for modeling dynamics does better than time-shuffled baseline, but no strong baseline tested.
- Qualitative results are also hard to judge since dynamics are pretty simple and limited to 2D translations

Cannot account for large changes in appearance, e.g. due to changes in perspective, since appearance parameters are fixed
Size of objects and initial poses are artificially limited
Although the pose parameter can be extended to 3D, it is unclear how well it will work in practice
- One possible concern could be the effect on the position regression loss, which was shown to be critical
Provides a good starting point for learning dynamics from raw videos with structured, interpretable object-centric parameterization