paper review

High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

In line with Rich Sutton's 'The Bitter Lesson', the improvement of video prediction performance as model capacity increases leaves an open question about how far we can get by finding the right combination of maximal model capacity and minimal inductive bias.

Contrastive Learning of Structured World Models

Contrastively-trained Structured World Models (C-SWMs) depart from traditional pixel-based reconstruction losses and use an energy-based hinge loss for learning object-centric world models.

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

By training on a large-scale repository of Visual Commonsense Graphs, VisualCOMET, a single stream vision-language transformer model, is able to generate inferences about past and present events by integrating information from the image with textual descriptions of the present event and location.

Forward Prediction for Physical Reasoning

Demonstrates the potential of forward-prediction for solving PHYRE physical reasoning tasks by investigating various combinations of object and pixel-based forward-prediction and task-solution models.

Attention Is All You Need

The Transformer, a sequence transduction model that replaces recurrent layers and relies entirely on attention mechanisms, achieves new SotA on machine translation tasks while reducing training time significantly.

Hierarchical Relational Inference

Hierarchical Relational Inference (HRI) learns hierarchical object representations and their relations directly from raw visual inputs, but is evaluated against limited baselines on simple datasets

IntPhys 2019: A Benchmark for Visual Intuitive Physics Understanding

IntPhys provides a well-designed benchmark for evaluting a system's understanding of a few core concepts about the physics of objects.

Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Region Proposal Interaction Networks (RPIN) learn to reason about object trajectories in a latent region-proposal feature space, that captures object and contextual information.

Occlusion resistant learning of intuitive physics from videos

Combines a compositional rendering network with a recurrent interaction network to learn dynamics in scenes with significant occlusion, but relies on ground-truth object positions and segmentations.

RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces

RELATE builds upon the interpretable, structured latent parameterization of BlockGAN by modeling the correlations between object parameters to generate realistic videos of dynamic scenes, using raw, unlabeled data.