Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Authors:
Evan Greensmith;Peter L. Bartlett;Jonathan Baxter
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2004

Citing 16
Cited 15

Likelihood ratio gradient estimation for stochastic systems

Communications of the ACM - Special issue on simulation
A survey of algorithmic methods for partially observed Markov decision processes

Annals of Operations Research
How to optimize discrete-event systems from a single sample path by the score function method

Annals of Operations Research
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Linear least-squares algorithms for temporal difference learning

Machine Learning - Special issue on reinforcement learning
Planning and acting in partially observable stochastic domains

Artificial Intelligence
Gradient descent for general reinforcement learning

Proceedings of the 1998 conference on Advances in neural information processing systems II
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Reinforcement Learning in POMDPs with Function Approximation

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
On Actor-Critic Algorithms

SIAM Journal on Control and Optimization
Infinite-horizon policy-gradient estimation

Journal of Artificial Intelligence Research
Experiments with infinite-horizon, policy-gradient estimation

Journal of Artificial Intelligence Research
Some inequalities for information divergence and related measures of discrimination

IEEE Transactions on Information Theory

Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

The Journal of Machine Learning Research
2008 Special Issue: Reinforcement learning of motor skills with policy gradients

Neural Networks
A semiparametric statistical approach to model-free policy evaluation

Proceedings of the 25th international conference on Machine learning
The factored policy-gradient planner

Artificial Intelligence
Learning when to stop thinking and do something!

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Geometric variance reduction in Markov chains: application to value function and gradient estimation

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
A variance analysis for POMDP policy evaluation

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Natural actor-critic algorithms

Automatica (Journal of IFAC)
Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning

Neural Computation
On-Line Policy Gradient Estimation with Multi-Step Sampling

Discrete Event Dynamic Systems
A Convergent Online Single Time Scale Actor Critic Algorithm

The Journal of Machine Learning Research
Analysis and improvement of policy gradient estimation

Neural Networks
Analysis of a natural gradient algorithm on monotonic convex-quadratic-composite functions

Proceedings of the 14th annual conference on Genetic and evolutionary computation
Efficient sample reuse in policy gradients with parameter-based exploration

Neural Computation
Reinforcement learning in robotics: A survey

International Journal of Robotics Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system. For the baseline technique, we compute the optimal baseline, and show that the popular approach of using the average reward to define the baseline can be suboptimal. For actor-critic algorithms, we show that using the true value function as the critic can be suboptimal. We also discuss algorithms for estimating the optimal baseline and approximate value function.