Region Proposal Interaction Networks (RPIN) learn to reason about object trajectories in a latent region-proposal feature space, that captures object and contextual information.
Analysis of invariances in representations from contrastive self-supervised models reveals that they leverage aggressive cropping on object-centric datasets to improve occlusion invariance at the expense of viewpoint and category instance invariance.
Novel method for video prediction from a single frame by decomposing the scene into entities with location and appearance features, capturing ambiguities with a global latent variable.
Proposes multitask model to jointly learn semantic goal navigation and embodied question answering.