Region Proposal Interaction Networks (RPIN) learn to reason about object trajectories in a latent region-proposal feature space, that captures object and contextual information.
Combines a compositional rendering network with a recurrent interaction network to learn dynamics in scenes with significant occlusion, but relies on ground-truth object positions and segmentations.
RELATE builds upon the interpretable, structured latent parameterization of BlockGAN by modeling the correlations between object parameters to generate realistic videos of dynamic scenes, using raw, unlabeled data.
Analysis of invariances in representations from contrastive self-supervised models reveals that they leverage aggressive cropping on object-centric datasets to improve occlusion invariance at the expense of viewpoint and category instance invariance.
Introduces a novel hierarchical representation of visual scenes, Physical Scene Graphs (PSGs), as well as a network for learning them from RGB movies, PSGNet, which outperforms other unsupervised methods in scene segmentation.