Policy Gradient Critics

Authors:
Daan Wierstra;Jürgen Schmidhuber
Affiliations:
Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA), CH-6928 Manno-Lugano, Switzerland;Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA), CH-6928 Manno-Lugano, Switzerland and Department of Embedded Systems and Robotics, Technical University Munich, D-85748 Garchin ...
Venue:
ECML '07 Proceedings of the 18th European conference on Machine Learning
Year:
2007

Citing 6
Cited 2

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Co-evolving recurrent neurons learn deep memory POMDPs

GECCO '05 Proceedings of the 7th annual conference on Genetic and evolutionary computation
Long Short-Term Memory

Neural Computation
A learning algorithm for continually running fully recurrent neural networks

Neural Computation
Learning finite-state controllers for partially observable environments

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

Anticipatory Behavior in Adaptive Learning Systems
Sequential constant size compressors for reinforcement learning

AGI'11 Proceedings of the 4th international conference on Artificial general intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present Policy Gradient Actor-Critic (PGAC), a new model-free Reinforcement Learning (RL) method for creating limited-memory stochastic policiesfor Partially Observable Markov Decision Processes (POMDPs) that require long-term memories of past observations and actions. The approach involves estimating a policy gradient for an Actor through a Policy Gradient Critic which evaluates probability distributions on actions. Gradient-based updates of history-conditional action probability distributions enable the algorithm to learn a mapping from memory states (or event histories) to probability distributions on actions, solving POMDPs through a combination of memory and stochasticity. This goes beyond previous approaches to learning purely reactive POMDP policies, without giving up their advantages. Preliminary results on important benchmark tasks show that our approach can in principle be used as a general purpose POMDP algorithm that solves RL problems in both continuous and discrete action domains.