Control What You Can: Intrinsically Motivated Task-Planning Agent

Blaes et al., 2019

Elias Z. Wang

Published Aug 1, 2020 Artificial Intelligence

Source: Blaes et al., 2019

Summary

Addresses how to make an agent learn efficiently to control its environment with minimal external reward
Proposes method that combines task-level planning with intrinsic motivation
Improved performance compared to intrinsically motivated, non-hierarchical and hierarchical baselines in synthetic and robotic manipulation environments
Links: [ website ] [ pdf ]

Background

Babies seemingly conduct experiments on the world and analyze the statistics of their observations to form an understanding of their world
Undirected “play” behavior is also commonly observed, which can be viewed as trying to gain control
- One example is learning to use tools to increase what is controllable

Methods

Assume observable state space is partitioned into potentially controllable components (goal spaces), manipulation of these components is formulated as tasks
The perception problem of constructing the goal spaces from sensor modalities (e.g.image data) is not considered
Control What You Can (CWYC) approach:
- Tasks defined by components of the state space
- Task selector, implemented as multi-armed bandit, selects (final) tasks that maximizes expected learning progress
- $\epsilon$-greedy task planner computes sub-task sequence form learned task graph, which captures how quickly a sub-task can be solved when another sub-task is performed directly before
- Sub-goal generator, implemented with relational attention networks, create goals in the current sub-task to maximize success in subsequent task
- Goal-conditioned task-specific low-level policies control the agent (SAC or DDPG+HER)
- Training results (success rate, progress, surprise) is stored in a history buffer
- Intrisic motivation module computes rewards for task selector, task planner, and sub-goal generator
Learning progress is defined as the time derivative of the success rate, with success defined as reaching a goal state within some tolerance
Use thresholded prediction error to bootstrap early learning in task selector

Results

Environments:
- Synthetic environment: 2-DOF point mass agent with several objects in an enclosed area, implemented in MuJoCo, continuous state and action spaces
  - Contains four objects: tool, heavy object, unreliable (50%) object, and random object
  - Arena is large so random encounters are unlikely
- Robotic manipulation: robotic arm with gripper (3+1 DOF) in front of table with hook and box at random, out of reach locations, needs to use hook
  - Goal spaces defined as: reaching target position with gipper, manipulating the hook, manipulating the box
  - Objects relations are less obvious, but random manipulations are more frequent
Metrics:
- Success rate: overall success of reaching random goal in each task space
Baselines:
- Hierarchial reinforcement learning with off-policy correction (HIRO): sovles each task independently
- Intrisic curiosity module with surprise (ICM-S)
- Intrisic curiosity module with raw prediction error (ICM-E)
- SAC: low-level controller
- DDPG+HER: low-level controller
- CWYC w oracle: oracle task planner and sub-goal generator
In synthetic environment, methods that treat each task independently (SAC, ICM-(S/E), HIRO) can only solve locomotion, while CWYC solves all three tasks: locomotion, tool use, and heavy object manipulation
- Task selector learns curriculum order of locomotion, tool, heavy object, then finally 50% object
In the robotic manipulation, only CWYC and DDPG+HER can solve all three tasks
- CWYC has slightly better sample complexity compared to DDPG+HER
- Demonstrates that suprise (e.g. unintentionally hitting the tool) helps identify funnel states with ~30 positive samples

Conclusion

The success of DDPG+HER in the robotic arm environment makes the results less convincing, this baseline was not used in the synthetic environment
State space is relatively low-dimensional and goals are only 2-D, $(x, y)$
Parameterization of sub-goal generator seems particularly suited for the specific context
Prior information encoded in task/state space limits its practical use

2019 paper review reinforcement learning intrinsic motivation NeurIPS planning Max-Planck Institute for Intelligent Systems

Control What You Can: Intrinsically Motivated Task-Planning Agent

Summary

Background

Methods

Results

Conclusion

Elias Z. Wang

AI Researcher | PhD Candidate