Occlusion resistant learning of intuitive physics from videos

Riochet et al., 2020

Elias Z. Wang

Published Nov 9, 2020 Artificial Intelligence

Source: Riochet et al., 2020

Summary

Propose a probabilistic formulation for learning intuitive physics in 3D scenes with significant occlusion
Object proposals, from a pretrained network or ground-truth segmentations, are linked across frames using a combination of a recurrent interaction network, object-centric dynamics model, and compositional renderer
Tested on intuitive physics benchmark, synthetic dataset with heavy occlusion, and real videos
Links: [ website ] [ pdf ]

Background

Developing general-purpose methods that can make physical predictions in noisy environments is difficult, especially since objects can be fully occluded by other objects for significant durations
Video prediction, which operates in pixel space, generally fail to preserve object properties and create blurry outputs, especially as the number of objects increase
Previous methods that combine object-centric dynamics models with visual encoders have not focused on 3D scenes with significant occlusion
The proposed method also predicts a plausibility of an observed dynamic scene and infers velocities of objects as latents allowing for trajectory predictions despite occlusion

Methods

Event decoding:
- Assign a sequence of underlying object states to a sequence of video frames, $\hat{S} = a r g m a x_{S} P (S | F, θ)$
- Can be decomposed into a rendering model, $P (F | S, θ)$ , and a physical model, $P (S | θ)$
- Simply by:
  - operating in mask space instead of pixel space, using pretrained instance mask detector
  - expressing state space in 2.5D instead of 3D, bypassing the need to learn inverse projective geometry
  - implementing probabilistic models as neural networks
  - estimating optimal state, $\hat{S}$ , using a combination of a pixel-wise rendering loss and $l_{2}$ physics loss
- Scene graph proposal gives initial object states, which are linked across time using RecIntNet and a nearest neighbor strategy, then this initial scene interpretation is optimized by minimizing the total loss through both RecIntNet and Renderer on the entire object state sequence
- The inverse of the total loss is used as a plausibility score
Renderer:
- Predicts a segmentation mask given a list of properties (e.g. $x$ and $y$ position, depth, type, size) for $N$ objects
- Can take a variable number of objects as input and invariant to order of objects
- Object rendering network: reconstructs a segmentation mask and depth map for each object
- Occlusion predictor: composes the $N$ predicted object masks, using the predicted depth maps
RecIntNet:
- Extends Interaction Networks (Battaglia et al., 2016) in three ways:
  - Model 2.5D scenes by adding depth component
  - Change to recurrent network, to learn from multiple future frames by “rolling-out” during training
    - Directly predict changes in velocity
    - Latent object properties unchanged
  - Introduce variance in the position predictions, assuming object position follows a multivariate normal distribution, with diagonal covariance
    - Weights the physics loss by the estimated noise level

Results

Datasets:
- IntPhys: videos of possible and impossible events, split into three blocks where objects may: disappear (O1), change shape (O2), and “teleport” (O3)
  - Half of the impossible events occur in plain view, while the other half occurs under full occlusion
- Synthetic dataset with videos of balls of different colors bouncing (on the ground) in a large box
  - Five views: Top view (90 $^{\circ}$ ), top view+occ (moving object occluding 25%), 45 $^{\circ}$ , 25 $^{\circ}$ , and 15 $^{\circ}$
- Real videos from Kinect2 with setup similar to top view of synthetic dataset
Results are similar to previous SotA on IntPhys for visible scenarios, but much better for occlusion on O1 and O2 datasets
Does better than simple baselines (Linear, MLP, NoDyn, NoProba) based on $l_{2}$ error for 5 and 10-frame prediction horizons
- Performace does decrease noticeably as camera angle decreases, i.e. more inter-object occlusion
Generalization to real videos without any finetuning does okay

Conclusion

Requires ground-truth object positions and segmentations, limiting its real world applicaiton
Using a dynamics model to link observations and incorporating uncertainty are interesting
Overall the datasets seem to be relatively simple, without any true 3D dynamics

computer vision paper review arXiv FAIR 2020 Inria ENS intuitive physics CIIRC video prediction

Occlusion resistant learning of intuitive physics from videos

Summary

Background

Methods

Results

Conclusion

Elias Z. Wang

AI Researcher | PhD Candidate