Learning Long-term Visual Dynamics with Region Proposal Interaction Networks
Qi et al., 2020
Summary
- Aims to build object representations that can capture inter-object and object-environment interactions over long time horizons
- Region Proposal Interaction Networks (RPIN) learn to reason about object trajectories in a latent region-proposal feature space
- Able to improve predictions in billiards, PHYRE, and ShapeStacks datasets as well as be used for planning actions
- Links: [ website ] [ pdf ]
Background
- Predicting changing dynamics in a visual scene is a core component of physical common sense
- Some approaches treat this as a image translation problem and predict pixels directly without imposing an object-centric representation perform poorly
- Other methods that operate on videos using object-centric representations have limitations such as lacking contextual reasoning or needing a predetermined number of objects
- ROI pooling is used to extract object representations from raw frames, while also incorporating information about the context (i.e. environment)
- Instead of predicting pixels, RPIN is trained end-to-end by minimizing distance between predicted and ground-truth object trajectories
Methods
- RPIN takes $N$ video frames as input and predicts the object locations for $T$ future timesteps.
- First extract image features using an hourglass CNN for each frame, enabling global context information
- Apply FCN on top of ROI pooling to obtain object-centric visual features
- Apply FCN to object position to get position features
- Concatenated object visual and position features are fed into interaction module, which updates the object features based on inter-object interactions
- These new object features are fed into prediction module that predicts the next timestep object state representation
- A final one layer decoder estimates the spatial location for each object from the predicted object representation
- The training object is a combination of $l_2$ distance between predicted and ground-truth positions as well as relative positions (offset)
- Also has a discounting factor to mitigate poor predictions in early training
Results
- Datasets:
- Simulation Billiards: three different colored balls with random velocity applied to one ball
- Real World Billiards: “Three-cushion Billiards” videos from YouTube, bounding boxes obtained by fine-tuning pretrained ResNet-101 FPN
- PHYRE: physical reasoning dataset, treat moving balls as objects and other static bodies as background
- ShapeStacks: synthetic dataset of stacked objects (cubes, cylinders, or balls)
- Baselines:
- Visual Interaction Network (VIN): directly assigns different channels of image features to objects, requires specifying fixed number of objects
- Object Masking (OM): takes one image and $m$ object proposals, creating $m$ masked images, disregards any background information
- Compositional Video Prediction (CVP): object feature obtained by feeding cropped object image patch into encoder, limited contextual information
- Metric:
- Squared error between predicted and ground-truth positions averaged across objects and timesteps
- Dynamics Prediction:
- OM does better than other baselines on billiard and PHYRE due to explicitly modeling objects, all baselines work decently on ShapeStacks
- RPIN does best on all datasets, combining explicit object modeling and context feature learning
- Generalizes better than baselines to simulation billiard dataset with more balls, larger balls, ShapeStacks with more blocks, and new tasks in PHYRE
- Dynamics prediction is poor using only position features, adding local interaction constraints and position features to visual features improve predictions
- RPIN outperforms baselines for planning (billiard target state, billiard hitting, PHYRE), although VIN is close for billiard tasks.
- VIN cannot be applied to PHYRE, since there is a variable number of objects
Conclusion
- Main innovation is the use of ROI pooling to learn object-centric visual features with contextual information
- Requires bounding boxes as input for position labels and ROI proposals, limiting real world applicability
- Unclear how much context is actually needed to solve these tasks
- PHYRE artificially made to require more context by only considering moving balls as objects