Experiments with infinite-horizon, policy-gradient estimation

Authors:
Jonathan Baxter;Peter L. Bartlett;Lex Weaver
Affiliations:
WhizBang! Labs., Pittsburgh, PA;BIOwulf Technologies., Berkeley, CA;Department of Computer Science, Australian National University, Canberra, Australia
Venue:
Journal of Artificial Intelligence Research
Year:
2001

Citing 15
Cited 19

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Practical Issues in Temporal Difference Learning

Machine Learning
TD-Gammon, a self-teaching backgammon program, achieves master-level play

Neural Computation
Stochastic approximation for Monte Carlo optimization

WSC '86 Proceedings of the 18th conference on Winter simulation
Gradient descent for general reinforcement learning

Proceedings of the 1998 conference on Advances in neural information processing systems II
Learning to Play Chess Using Temporal Differences

Machine Learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Feedforward Neural Network Methodology

Feedforward Neural Network Methodology
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Reinforcement Learning in POMDPs with Function Approximation

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning

COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Reinforcement Learning in POMDP's via Direct Gradient Ascent

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A reinforcement learning approach to job-shop scheduling

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning

Discrete Event Dynamic Systems
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

The Journal of Machine Learning Research
Basic Ideas for Event-Based Optimization of Markov Systems

Discrete Event Dynamic Systems
Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity

Neural Computation
STEWARD: demo of spatio-textual extraction on the web aiding the retrieval of documents

dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Shaping multi-agent systems with gradient reinforcement learning

Autonomous Agents and Multi-Agent Systems
2008 Special Issue: Reinforcement learning of motor skills with policy gradients

Neural Networks
Non-parametric policy gradients: a unified treatment of propositional and relational domains

Proceedings of the 25th international conference on Machine learning
Reinforcement Learning in Fine Time Discretization

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I
The factored policy-gradient planner

Artificial Intelligence
Structured prediction with reinforcement learning

Machine Learning
On the asymptotic equivalence between differential Hebbian and temporal difference learning

Neural Computation
Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning

Neural Computation
On-Line Policy Gradient Estimation with Multi-Step Sampling

Discrete Event Dynamic Systems
Solving deep memory POMDPs with recurrent policy gradients

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Incorporating domain models into Bayesian optimization for RL

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Analysis and improvement of policy gradient estimation

Neural Networks
Self-organizing relays in LTE networks: queuing analysis and algorithms

Proceedings of the 7th International Conference on Network and Services Management
Sparse gradient-based direct policy search

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter β ∈ [0, 1], which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.