2020

Training data-efficient image transformers & distillation through attention

Produces competitive convolution-free transformer, training only on ImageNet.

Scaling Laws for Neural Language Models

A large-scale empirical invesigation of scaling laws shows that performance has a power-law relationship to model size, dataset size, and training compute, while architectural details have minimal effects.

Self-supervised learning through the eyes of a child

Applies self-supervised learning algorithms to developmentally realistic, longitudinal, egocentric video from young children and demonstrates the emergence of high-level visual representations.

Are we done with ImageNet?

Proposes a new set of ImageNet labels that address the limitations of the original labels resulting from multiple objects in a single image and synonymous labels.

Contrastive Learning of Structured World Models

Contrastively-trained Structured World Models (C-SWMs) depart from traditional pixel-based reconstruction losses and use an energy-based hinge loss for learning object-centric world models.

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

By training on a large-scale repository of Visual Commonsense Graphs, VisualCOMET, a single stream vision-language transformer model, is able to generate inferences about past and present events by integrating information from the image with textual descriptions of the present event and location.

Training data-efficient image transformers & distillation through attention

Scaling Laws for Neural Language Models

Self-supervised learning through the eyes of a child

Are we done with ImageNet?

Contrastive Learning of Structured World Models

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

Forward Prediction for Physical Reasoning

Hierarchical Relational Inference

IntPhys 2019: A Benchmark for Visual Intuitive Physics Understanding

Learning Long-term Visual Dynamics with Region Proposal Interaction Networks