High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

Villegas et al., 2019

Source: Villegas et al., 2019


  • Addresses whether specialize architectures are needed for video prediction
  • Investigates the performance of video prediction model as network capacity increases
  • Demonstrates that scaling up model size improves prediction accuracy
  • Links: [ website ] [ pdf ]


  • Learning accurate predictive models of the (visual) world remains a crucial, yet challenging, problem
  • Some prior work towards solving this problem use multi-modal sensory streams, specialized computations (e.g. optical flow), or additional high-level information (e.g. landmarks, segmentations)
  • Increasing model capacity has commonly been shown to improve performance across various domains
    • Possibly because it better leverages the benefits of learning


  • Use Stochastic Video Generation (SVG) as base architecture, since it only uses standard neural network layers
    • Change encoder-decoder to only have convolutional layers for more detailed reconstruction since bottleneck is larger
    • Use convolutional LSTM instead of fully-connected LSTM
    • Use $l_1$ loss instead of $l_2$ for sharper frame prediction
  • Scale up number of neurons in encoder-decoder by factor of $K$ and LSTM by factor of $M$
    • $M_{max}=3$, $K_{max}=5$


  • Baselines:
    • LSTM: remove stocahstic component
    • CNN: remove stocahstic component and LSTM component
  • Datasets:
    • Action-conditioned towel pick: robot arm is interacting with towels
    • Human 3.6M: humans performing actions inside a room (e.g. walking, sitting)
    • KITTI: driving dataset with partial observability
  • Metrics:
    • Frame-wise:
      • Peak Signal-to-Noise (PSNR): measures exact pixel match
      • Structural Similarity (SSIM): measures exact pixel match
      • VGG Cosine Similarity: measures perceptual similarity based on similarity of VGG features
    • Dyanimcs-based:
      • Frechet Video Distance (FVD): compares quality of generated videos to ground-truth videos using features from video classification network
      • Amazon Mechanical Turk (AMT): humans asked which of two videos were more realistic or if they looked about the same
  • Results summary:
    • Largest SVG model performs best with FVD for Towel pick and Human 3.6M, while largest LSTM performs slightly better for KITTI
    • Performance incresases with model capacity for all models (SVG, LSTM, CNN)
    • CNN, without recurrence, does poorly overall
    • Humans prefer predictions from larger models, but it also seems like LSTM predictions are more realistic than SVG
    • SVG and LSTM perform similarly based on frame-wise metrics, with SVG slightly outperforming for longer rollouts


  • Verifies the intuition and general trend that larger capacity networks perform better, given sufficient training data
    • Possible to improve even more by training on higher resolution images
  • While they compare to ablations it would have been interesting to see how architectures with additional inductive biases compare, especially with respect to number of parameters
  • The poor performance without recurrence might motivate the use of object-centric dynamics models, which seem to perform reasonably without recurrence
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate