Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Qi et al., 2020

Source: Qi et al., 2020


  • Aims to build object representations that can capture inter-object and object-environment interactions over long time horizons
  • Region Proposal Interaction Networks (RPIN) learn to reason about object trajectories in a latent region-proposal feature space
  • Able to improve predictions in billiards, PHYRE, and ShapeStacks datasets as well as be used for planning actions
  • Links: [ website ] [ pdf ]


  • Predicting changing dynamics in a visual scene is a core component of physical common sense
  • Some approaches treat this as a image translation problem and predict pixels directly without imposing an object-centric representation perform poorly
  • Other methods that operate on videos using object-centric representations have limitations such as lacking contextual reasoning or needing a predetermined number of objects
  • ROI pooling is used to extract object representations from raw frames, while also incorporating information about the context (i.e. environment)
  • Instead of predicting pixels, RPIN is trained end-to-end by minimizing distance between predicted and ground-truth object trajectories


  • RPIN takes $N$ video frames as input and predicts the object locations for $T$ future timesteps.
    • First extract image features using an hourglass CNN for each frame, enabling global context information
    • Apply FCN on top of ROI pooling to obtain object-centric visual features
    • Apply FCN to object position to get position features
    • Concatenated object visual and position features are fed into interaction module, which updates the object features based on inter-object interactions
    • These new object features are fed into prediction module that predicts the next timestep object state representation
    • A final one layer decoder estimates the spatial location for each object from the predicted object representation
  • The training object is a combination of $l_2$ distance between predicted and ground-truth positions as well as relative positions (offset)
    • Also has a discounting factor to mitigate poor predictions in early training


  • Datasets:
    • Simulation Billiards: three different colored balls with random velocity applied to one ball
    • Real World Billiards: “Three-cushion Billiards” videos from YouTube, bounding boxes obtained by fine-tuning pretrained ResNet-101 FPN
    • PHYRE: physical reasoning dataset, treat moving balls as objects and other static bodies as background
    • ShapeStacks: synthetic dataset of stacked objects (cubes, cylinders, or balls)
  • Baselines:
    • Visual Interaction Network (VIN): directly assigns different channels of image features to objects, requires specifying fixed number of objects
    • Object Masking (OM): takes one image and $m$ object proposals, creating $m$ masked images, disregards any background information
    • Compositional Video Prediction (CVP): object feature obtained by feeding cropped object image patch into encoder, limited contextual information
  • Metric:
    • Squared error between predicted and ground-truth positions averaged across objects and timesteps
  • Dynamics Prediction:
    • OM does better than other baselines on billiard and PHYRE due to explicitly modeling objects, all baselines work decently on ShapeStacks
    • RPIN does best on all datasets, combining explicit object modeling and context feature learning
  • Generalizes better than baselines to simulation billiard dataset with more balls, larger balls, ShapeStacks with more blocks, and new tasks in PHYRE
  • Dynamics prediction is poor using only position features, adding local interaction constraints and position features to visual features improve predictions
  • RPIN outperforms baselines for planning (billiard target state, billiard hitting, PHYRE), although VIN is close for billiard tasks.
    • VIN cannot be applied to PHYRE, since there is a variable number of objects


  • Main innovation is the use of ROI pooling to learn object-centric visual features with contextual information
  • Requires bounding boxes as input for position labels and ROI proposals, limiting real world applicability
  • Unclear how much context is actually needed to solve these tasks
    • PHYRE artificially made to require more context by only considering moving balls as objects
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate