Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Qi et al., 2020

Elias Z. Wang

Published Nov 10, 2020 Artificial Intelligence

Source: Qi et al., 2020

Summary

Aims to build object representations that can capture inter-object and object-environment interactions over long time horizons
Region Proposal Interaction Networks (RPIN) learn to reason about object trajectories in a latent region-proposal feature space
Able to improve predictions in billiards, PHYRE, and ShapeStacks datasets as well as be used for planning actions
Links: [ website ] [ pdf ]

Background

Predicting changing dynamics in a visual scene is a core component of physical common sense
Some approaches treat this as a image translation problem and predict pixels directly without imposing an object-centric representation perform poorly
Other methods that operate on videos using object-centric representations have limitations such as lacking contextual reasoning or needing a predetermined number of objects
ROI pooling is used to extract object representations from raw frames, while also incorporating information about the context (i.e. environment)
Instead of predicting pixels, RPIN is trained end-to-end by minimizing distance between predicted and ground-truth object trajectories

Methods

RPIN takes $N$ video frames as input and predicts the object locations for $T$ future timesteps.
- First extract image features using an hourglass CNN for each frame, enabling global context information
- Apply FCN on top of ROI pooling to obtain object-centric visual features
- Apply FCN to object position to get position features
- Concatenated object visual and position features are fed into interaction module, which updates the object features based on inter-object interactions
- These new object features are fed into prediction module that predicts the next timestep object state representation
- A final one layer decoder estimates the spatial location for each object from the predicted object representation
The training object is a combination of $l_2$ distance between predicted and ground-truth positions as well as relative positions (offset)
- Also has a discounting factor to mitigate poor predictions in early training

Results

Datasets:
- Simulation Billiards: three different colored balls with random velocity applied to one ball
- Real World Billiards: “Three-cushion Billiards” videos from YouTube, bounding boxes obtained by fine-tuning pretrained ResNet-101 FPN
- PHYRE: physical reasoning dataset, treat moving balls as objects and other static bodies as background
- ShapeStacks: synthetic dataset of stacked objects (cubes, cylinders, or balls)
Baselines:
- Visual Interaction Network (VIN): directly assigns different channels of image features to objects, requires specifying fixed number of objects
- Object Masking (OM): takes one image and $m$ object proposals, creating $m$ masked images, disregards any background information
- Compositional Video Prediction (CVP): object feature obtained by feeding cropped object image patch into encoder, limited contextual information
Metric:
- Squared error between predicted and ground-truth positions averaged across objects and timesteps
Dynamics Prediction:
- OM does better than other baselines on billiard and PHYRE due to explicitly modeling objects, all baselines work decently on ShapeStacks
- RPIN does best on all datasets, combining explicit object modeling and context feature learning
Generalizes better than baselines to simulation billiard dataset with more balls, larger balls, ShapeStacks with more blocks, and new tasks in PHYRE
Dynamics prediction is poor using only position features, adding local interaction constraints and position features to visual features improve predictions
RPIN outperforms baselines for planning (billiard target state, billiard hitting, PHYRE), although VIN is close for billiard tasks.
- VIN cannot be applied to PHYRE, since there is a variable number of objects

Conclusion

Main innovation is the use of ROI pooling to learn object-centric visual features with contextual information
Requires bounding boxes as input for position labels and ROI proposals, limiting real world applicability
Unclear how much context is actually needed to solve these tasks
- PHYRE artificially made to require more context by only considering moving balls as objects

computer vision paper review arXiv 2020 intuitive physics CMU UC Berkeley UCSD

Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Summary

Background

Methods

Results

Conclusion

Elias Z. Wang

AI Researcher | PhD Candidate