# Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Qi et al., 2020

Source: Qi et al., 2020

## Summary

• Aims to build object representations that can capture inter-object and object-environment interactions over long time horizons
• Region Proposal Interaction Networks (RPIN) learn to reason about object trajectories in a latent region-proposal feature space
• Able to improve predictions in billiards, PHYRE, and ShapeStacks datasets as well as be used for planning actions
• Links: [ website ] [ pdf ]

## Background

• Predicting changing dynamics in a visual scene is a core component of physical common sense
• Some approaches treat this as a image translation problem and predict pixels directly without imposing an object-centric representation perform poorly
• Other methods that operate on videos using object-centric representations have limitations such as lacking contextual reasoning or needing a predetermined number of objects
• ROI pooling is used to extract object representations from raw frames, while also incorporating information about the context (i.e. environment)
• Instead of predicting pixels, RPIN is trained end-to-end by minimizing distance between predicted and ground-truth object trajectories

## Methods

• RPIN takes $N$ video frames as input and predicts the object locations for $T$ future timesteps.
• First extract image features using an hourglass CNN for each frame, enabling global context information
• Apply FCN on top of ROI pooling to obtain object-centric visual features
• Apply FCN to object position to get position features
• Concatenated object visual and position features are fed into interaction module, which updates the object features based on inter-object interactions
• These new object features are fed into prediction module that predicts the next timestep object state representation
• A final one layer decoder estimates the spatial location for each object from the predicted object representation
• The training object is a combination of $l_2$ distance between predicted and ground-truth positions as well as relative positions (offset)
• Also has a discounting factor to mitigate poor predictions in early training

## Results

• Datasets:
• Simulation Billiards: three different colored balls with random velocity applied to one ball
• Real World Billiards: “Three-cushion Billiards” videos from YouTube, bounding boxes obtained by fine-tuning pretrained ResNet-101 FPN
• PHYRE: physical reasoning dataset, treat moving balls as objects and other static bodies as background
• ShapeStacks: synthetic dataset of stacked objects (cubes, cylinders, or balls)
• Baselines:
• Visual Interaction Network (VIN): directly assigns different channels of image features to objects, requires specifying fixed number of objects
• Object Masking (OM): takes one image and $m$ object proposals, creating $m$ masked images, disregards any background information
• Compositional Video Prediction (CVP): object feature obtained by feeding cropped object image patch into encoder, limited contextual information
• Metric:
• Squared error between predicted and ground-truth positions averaged across objects and timesteps
• Dynamics Prediction:
• OM does better than other baselines on billiard and PHYRE due to explicitly modeling objects, all baselines work decently on ShapeStacks
• RPIN does best on all datasets, combining explicit object modeling and context feature learning
• Generalizes better than baselines to simulation billiard dataset with more balls, larger balls, ShapeStacks with more blocks, and new tasks in PHYRE
• Dynamics prediction is poor using only position features, adding local interaction constraints and position features to visual features improve predictions
• RPIN outperforms baselines for planning (billiard target state, billiard hitting, PHYRE), although VIN is close for billiard tasks.
• VIN cannot be applied to PHYRE, since there is a variable number of objects

## Conclusion

• Main innovation is the use of ROI pooling to learn object-centric visual features with contextual information
• Requires bounding boxes as input for position labels and ROI proposals, limiting real world applicability
• Unclear how much context is actually needed to solve these tasks
• PHYRE artificially made to require more context by only considering moving balls as objects