2020

Occlusion resistant learning of intuitive physics from videos

Combines a compositional rendering network with a recurrent interaction network to learn dynamics in scenes with significant occlusion, but relies on ground-truth object positions and segmentations.

RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces

RELATE builds upon the interpretable, structured latent parameterization of BlockGAN by modeling the correlations between object parameters to generate realistic videos of dynamic scenes, using raw, unlabeled data.

Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

Analysis of invariances in representations from contrastive self-supervised models reveals that they leverage aggressive cropping on object-centric datasets to improve occlusion invariance at the expense of viewpoint and category instance invariance.

Bootstrap Your Own Latent A New Approach to Self-Supervised Learning

BYOL improves on SotA self-supervised methods by introducing a target network, which removes the need for negative examples.

A Simple Framework for Contrastive Learning of Visual Representations

SimCLR, a simple unsupervised contrastive learning framework, uses data augmentation for positive pairs, a nonlinear projection head, normalized temperature-scaled cross entropy loss, and large batch sizes to achieve SotA in self-supervised, semi-supervised, and transfer learning domains.

Learning Physical Graph Representations from Visual Scenes

Introduces a novel hierarchical representation of visual scenes, Physical Scene Graphs (PSGs), as well as a network for learning them from RGB movies, PSGNet, which outperforms other unsupervised methods in scene segmentation.

Embodied Multimodal Multitask Learning

Proposes multitask model to jointly learn semantic goal navigation and embodied question answering.