State-Dependent Exploration for Policy Gradient Methods

Authors:
Thomas Rückstieß;Martin Felder;Jürgen Schmidhuber
Affiliations:
Technische Universität München, Garching, Germany 85748;Technische Universität München, Garching, Germany 85748;Technische Universität München, Garching, Germany 85748
Venue:
ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Year:
2008

Citing 9
Cited 4

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Technical Note: \cal Q-Learning

Machine Learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
PEGASUS: A policy search method for large MDPs and POMDPs

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Reinforcement Learning in POMDP's via Direct Gradient Ascent

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
Solving deep memory POMDPs with recurrent policy gradients

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Natural actor-critic

ECML'05 Proceedings of the 16th European conference on Machine Learning
Learning to trade via direct reinforcement

IEEE Transactions on Neural Networks

Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

Anticipatory Behavior in Adaptive Learning Systems
A Generalized Path Integral Control Approach to Reinforcement Learning

The Journal of Machine Learning Research
Reinforcement learning in robotics: A survey

International Journal of Robotics Research
Policy oscillation is overshooting

Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Policy Gradient methods are model-free reinforcement learning algorithms which in recent years have been successfully applied to many real-world problems. Typically, Likelihood Ratio (LR) methods are used to estimate the gradient, but they suffer from high variance due to random exploration at every time step of each training episode. Our solution to this problem is to introduce a state-dependent exploration function (SDE) which during an episode returns the same action for any given state. This results in less variance per episode and faster convergence. SDE also finds solutions overlooked by other methods, and even improves upon state-of-the-art gradient estimators such as Natural Actor-Critic. We systematically derive SDE and apply it to several illustrative toy problems and a challenging robotics simulation task, where SDE greatly outperforms random exploration.