Entity Abstraction in Visual Model-Based Reinforcement Learning

Veerapaneni et al., 2019

Elias Z. Wang

Published Aug 19, 2020 Artificial Intelligence

Source: Veerapaneni et al., 2019

Summary

Presents a system for model-based reinforcement learning which learns from raw visual inputs
Object-centric perception, prediction, and planning (OP3) uses a fully probabilistic entity-centric dynamic latent variable framework
OP3 generalizes to novel block stacking configurations with more objects than observed in training
Links: [ website ] [ pdf ]

Background

Modeling scenes by modeling objects and the local processes that govern their interactions may provide benefits to generalization
Some limitations of other approaches:
- assumes fixed number of entities
- does not constrain instances of the same entity to be modeled the same way
- require entity identities as additional inputs, or other related supervision
- do not model whole entities coherently
- cannot update entity representations with new observations

Methods

Define probabilistic model where a set of latent variables represent the state of objects in the scene, in addition to random variables for the image observation and agent actions
Approximate posterior predictive distribution of observations $d$ steps into the future by learning parameters of an approximate observation distribution, dynamics distrubution, and time-factorized recognition distrubution with variational inference
Entity abstraction derived from assumption about symmetry, which reduces dynamic scene modeling to two steps:
1. Model single entity and its interactions with entity-centric function
2. Apply this function over each entity in the scene
Observation Model: approximates the distrubution of the observation given the entities
- Implemented using a mixture model at each pixel, where the mixture components models the observation for each individual entity
Dynamics Model: approximates the distribution of future entity latents given their current values and an action
- Combinatorially large space of object configurations and interactions
- Reduce the problem to modeling the effect on a single entity given the action and the entity’s interactions with the other entities
- Further reduce problem by enforcing pairwise interactions
Solve variable binding by inferring parameters of the posterior distribution of entity variables given a sequence of interactions
- Decomposed into recognition distribution applied to each entity
- Iterative approach to break symmetry for dividing responsibility of modeling different objects in the scene

Results

OP3 generalizes to solve block stacking (82%) while only being trained to predict how objects fall, 3x accuracy of SAVP (24%)
- Setup: In MuJoCo, block is raised in the air and model must predict steady-state effects of dropping it on surface with multiple objects
  - Up to five objects in training, up to nine in testing
  - Tested on constructing block tower specified by goal image
  - Actions chosen based on greedy approach, assuming single actions are sufficient to move an object to its goal position
- Metrics: Accuracy based on checking that all blocks are within some tolerance of goal
- Baselines:
  - SAVP: SOTA model for video prediction, does not process entities symmetrically
  - O2P2: processes entities symmetrically, but requires object segmentations
OP3 achieves 2.5x accuracy of SAVP on manipulating objects already present in the scene into goal position
- Setup: Initial scene contains all blocks needed
  - Action space represents picking and dropping location, only sucessfully picks up if action intersects with block outline
  - Moving objects out of place may be necessary, multi-step planning
Compared to IODINE applied to single frames, OP3 uses temporal information to better segment objects in real world videos of robotic arm moving deformable objects

Conclusion

Symmetric modeling of entities provides clear benefits for generalization in block manipulation tasks in simulation
Ability to continually update latent entity representations with new visual observations is another advantage
Limited demonstration of planning and real world capabilities

2019 computer vision paper review reinforcement learning Stanford MIT UC Berkeley compositionality CoRL physics world model

Entity Abstraction in Visual Model-Based Reinforcement Learning

Summary

Background

Methods

Results

Conclusion

Elias Z. Wang

AI Researcher | PhD Candidate