computer vision

Are we done with ImageNet?

Proposes a new set of ImageNet labels that address the limitations of the original labels resulting from multiple objects in a single image and synonymous labels.

High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

In line with Rich Sutton's 'The Bitter Lesson', the improvement of video prediction performance as model capacity increases leaves an open question about how far we can get by finding the right combination of maximal model capacity and minimal inductive bias.

Contrastive Learning of Structured World Models

Contrastively-trained Structured World Models (C-SWMs) depart from traditional pixel-based reconstruction losses and use an energy-based hinge loss for learning object-centric world models.

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

By training on a large-scale repository of Visual Commonsense Graphs, VisualCOMET, a single stream vision-language transformer model, is able to generate inferences about past and present events by integrating information from the image with textual descriptions of the present event and location.

Are we done with ImageNet?

High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

Contrastive Learning of Structured World Models

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

Forward Prediction for Physical Reasoning

Hierarchical Relational Inference

Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Occlusion resistant learning of intuitive physics from videos

RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces

Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases