Solving deep memory POMDPs with recurrent policy gradients

Authors:
Daan Wierstra;Alexander Foerster;Jan Peters;Jürgen Schmidhuber
Affiliations:
IDSIA, Manno-Lugano, Switzerland;IDSIA, Manno-Lugano, Switzerland;University of Southern California, Los Angeles, CA;IDSIA, Manno-Lugano, Switzerland
Venue:
ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Year:
2007

Citing 6
Cited 9

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Long Short-Term Memory

Neural Computation
Experiments with infinite-horizon, policy-gradient estimation

Journal of Artificial Intelligence Research
Learning finite-state controllers for partially observable environments

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Natural actor-critic

ECML'05 Proceedings of the 16th European conference on Machine Learning
Learning to trade via direct reinforcement

IEEE Transactions on Neural Networks

Accelerated Neural Evolution through Cooperatively Coevolved Synapses

The Journal of Machine Learning Research
Episodic Reinforcement Learning by Logistic Reward-Weighted Regression

ICANN '08 Proceedings of the 18th international conference on Artificial Neural Networks, Part I
State-Dependent Exploration for Policy Gradient Methods

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Neuroevolution strategies for episodic reinforcement learning

Journal of Algorithms
Evolving Memory Cell Structures for Sequence Learning

ICANN '09 Proceedings of the 19th International Conference on Artificial Neural Networks: Part II
The neuronal replicator hypothesis

Neural Computation
Natural actor and belief critic: Reinforcement algorithm for learning parameters of dialogue systems modelled as POMDPs

ACM Transactions on Speech and Language Processing (TSLP)
Observer effect from stateful resources in agent sensing

Autonomous Agents and Multi-Agent Systems
MineralMiner: An active sensing simulation environment

Multiagent and Grid Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents Recurrent Policy Gradients, a modelfree reinforcement learning (RL) method creating limited-memory stochastic policies for partially observable Markov decision problems (POMDPs) that require long-term memories of past observations. The approach involves approximating a policy gradient for a Recurrent Neural Network (RNN) by backpropagating return-weighted characteristic eligibilities through time. Using a "Long Short-Term Memory" architecture, we are able to outperform other RL methods on two important benchmark tasks. Furthermore, we show promising results on a complex car driving simulation task.