Artificial Intelligence

Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

Analysis of invariances in representations from contrastive self-supervised models reveals that they leverage aggressive cropping on object-centric datasets to improve occlusion invariance at the expense of viewpoint and category instance invariance.

Bootstrap Your Own Latent A New Approach to Self-Supervised Learning

BYOL improves on SotA self-supervised methods by introducing a target network, which removes the need for negative examples.

A Simple Framework for Contrastive Learning of Visual Representations

SimCLR, a simple unsupervised contrastive learning framework, uses data augmentation for positive pairs, a nonlinear projection head, normalized temperature-scaled cross entropy loss, and large batch sizes to achieve SotA in self-supervised, semi-supervised, and transfer learning domains.

Entity Abstraction in Visual Model-Based Reinforcement Learning

The Object-centric perception, prediction, and planning (OP3) framework demonstrates strong generalization to novel configurations in block stacking tasks by symmetrically processing entity representations extracted from raw visual observations.

Adversarial Examples that Fool both Computer Vision and Time-Limited Humans

Adversarial examples trained on an ensemble of CNNs with a retinal preprocessing layer reduce the accuracy of time-limited humans in a two alternative forced choice task.

Control What You Can: Intrinsically Motivated Task-Planning Agent

The Control What You Can (CWYC) method learns to control components of the environment to achieve multi-step goals by combining task planning with surprise and learning progress based intrinsic motivation.

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

A large scale, comprehensive study challenges various assumptions in learning disentangled representations, which motivates demonstrating concrete benefits in robust experimental setups in future work.

Learning Physical Graph Representations from Visual Scenes

Introduces a novel hierarchical representation of visual scenes, Physical Scene Graphs (PSGs), as well as a network for learning them from RGB movies, PSGNet, which outperforms other unsupervised methods in scene segmentation.

Compositional Video Prediction

Novel method for video prediction from a single frame by decomposing the scene into entities with location and appearance features, capturing ambiguities with a global latent variable.

Embodied Multimodal Multitask Learning

Proposes multitask model to jointly learn semantic goal navigation and embodied question answering.