Demonstrates that scaling up self-supervised methods along data size, model capacity, and problem complexity enables them to match or surpass ImageNet supervised pre-training on a variety of tasks.
Demonstrates that providing explantions and model criticism can be useful tools to improve the reliability of ImageNet-trained CNNs for end-users.
Produces competitive convolution-free transformer, training only on ImageNet.
Demonstrates the potential of forward-prediction for solving PHYRE physical reasoning tasks by investigating various combinations of object and pixel-based forward-prediction and task-solution models.
IntPhys provides a well-designed benchmark for evaluting a system's understanding of a few core concepts about the physics of objects.
Combines a compositional rendering network with a recurrent interaction network to learn dynamics in scenes with significant occlusion, but relies on ground-truth object positions and segmentations.
Analysis of invariances in representations from contrastive self-supervised models reveals that they leverage aggressive cropping on object-centric datasets to improve occlusion invariance at the expense of viewpoint and category instance invariance.
Novel method for video prediction from a single frame by decomposing the scene into entities with location and appearance features, capturing ambiguities with a global latent variable.
Proposes multitask model to jointly learn semantic goal navigation and embodied question answering.