Learning the Predictability of the Future

Suris et al., 2021

Source: Suris et al., 2021

Summary

Learning from unlabeled video what is predictable in the future
Build predictive model in hyperbolic space, which naturally encodes hierarchical structure
- Automatically selects higher level of abstraction when uncertain
Emergence of action hierarchies in learned representations
Links: [ website ] [ pdf ]

Future predction has been a core problem in computer vision
- How to figure out what to predict? e.g. pixels, activities, etc.
Most methods do not have adaptive representations and objectives that adapt to the uncertainties in video
- Methods that represent uncertainty with stochastic latent inputs are compatible with proposed method
Proposed method learns from data which features are predictable, instead of commiting up front to a level of abstraction to predict
- Jointly learns action hierarchy and correct level of abstraction within this hierarchy
- Hyperbolic space can be seen as the continuous analog of a tree, resulting in a hierarchy when representations are embedded in this space

Learn a video representation that is predictive of the future
- Predict a latent representation of the future, not pixels
Use Poincare ball model to define the distance between predicted and observed representation
- Trained using contrastive loss with hyperbolic distance as similarity measure, negative examples from other videos
  - Contrastive loss to prevent trivial representations that result when directly regressing latent representations
- In hyperbolic space, mean of two embeddings is a parent embedding (i.e. higher level abstraction)
Train classifier on top of learned representations
- Use hyperbolic multiclass logistic regression, since input representation is hyperbolic not Euclidean
Use standard (Euclidean) ResNet encoder, GRU, and MLP with a projection to hyperbolic space

After learning representation from unlabeled videos, transfer to target domain using smaller, labeled dataset
- Fine-tune representations then train supervised linear classifier
Datasets:
- Sports Videos: Self-supervise on Kinetics-600 (600 human action classes, 500,000 videos) and evaluate on FineGym (gymnastic videos with three-level hierarchical action labels)
- Movies: Self-supervise on MovieNet (1,100 movies, 758,000 key frames) and evaluate on Hollywood2 (two-level action hierarchy)
Metrics:
- Accuracy: accuracy on leaf classes
- Bottom-up hierarchical accuracy: partially correct if incorrect at leaf level but correct at higher levels (50% decay per level as you go up)
- Top-down hierarchical accuracy: 50% decay as you go down
Early action recognition: classify actions that have started but not finished
- Hyperbolic representations outperform Euclidean, with a larger gap when embedding dimension is smaller
Future action prediction: predict actions before they start, given past context
- Evidence for compactness of hierarchical representations is reversed

Not sure how to interpret future action prediction results since there’s no ground truth as to which level to predict when
Datasets don’t really distinguish between hierarchy of actions in time versus abstraction
- Higher-level actions in the hierarchy are both longer in duration and more abstract (e.g. “human interaction” > “handshake”)
Only applied to predicting actions, not lower-level object motions/dynamics
Regardless of the actual results for this specific model, the idea of using hyperbolic embeddings for hierarchical representations is interesting