Automatic Goal Generation for Reinforcement Learning Agents
Florensa et al., 2018
Summary
- Curriculum learning applied to RL context
- Generates target goals during training using GAN to ensure goals are of ‘intermediate’ difficulty
- Experiments with Ant on plane and in maze
- Links: [ website ] [ pdf ]
Background
- Most of the recent successes in RL are in settings where the agent is trained to optimize a single reward function for a single task.
- Want to create agents that can perform a variety of tasks
- Maximize average success rate over all possible goals
- Graves et al., 2017 focuses on supervised setting, can’t apply learning progres signals to RL setting with sparse rewards
Methods
- Define goal-parametrized reward function that indicates whether the agent reaches the goal within T steps
- For a given policy, expected return for the goal is the probability of success
- Captures the property that it may be difficult to tell if agent is getting closer to goal, but easy to tell if it’s achieved
- Overall objective is to find a policy that maximizes the average probability of success over goals sampled from a (test) goal distribution
- First, label set of goals as intermediate difficulty if expected reward for the goal with the current policy is between some min and max value (e.g. 0.1-0.9)
- Second, generate new goals of intermediate difficulty using a GAN
- Modified to also be trained on “negative examples”
- Update policy based on these goals, and repeat
Results
- Test on ant on plane and in maze, where goals are $(x,y)$ position
- Increased efficiency compared to uniform sampling, with and without l2 distance reward. Slightly better than asymetirc self-play and SAGG-RIAC in maze
- About 20% of generated goals are intermediate difficulty, even as policy improves
- Growing ring of goals in Free Ant
- Clear benefit of generating appropriate goals when feasible states are a lower-dimensional subset of the full state space
Conclusion
- Training the GAN on all previous goals actually doesn’t do that bad, for 2D
- No novelty based baselines, but using coverage as metric