Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Region Proposal Interaction Networks (RPIN) learn to reason about object trajectories in a latent region-proposal feature space, that captures object and contextual information.

Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

Analysis of invariances in representations from contrastive self-supervised models reveals that they leverage aggressive cropping on object-centric datasets to improve occlusion invariance at the expense of viewpoint and category instance invariance.

Compositional Video Prediction

Novel method for video prediction from a single frame by decomposing the scene into entities with location and appearance features, capturing ambiguities with a global latent variable.

Embodied Multimodal Multitask Learning

Proposes multitask model to jointly learn semantic goal navigation and embodied question answering.