The optimal reward baseline for gradient-based reinforcement learning

Authors:
Lex Weaver;Nigel Tao
Affiliations:
Department of Computer Science, Australian National University, ACT, Australia;Department of Computer Science, Australian National University, ACT, Australia
Venue:
UAI'01 Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence
Year:
2001

Citing 5
Cited 3

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Stochastic approximation for Monte Carlo optimization

WSC '86 Proceedings of the 18th conference on Winter simulation
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Reinforcement Learning in POMDPs with Function Approximation

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Temporal credit assignment in reinforcement learning

Temporal credit assignment in reinforcement learning

Analysis and improvement of policy gradient estimation

Neural Networks
Learning to make predictions in partially observable environments without a generative model

Journal of Artificial Intelligence Research
Efficient sample reuse in policy gradients with parameter-based exploration

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

There exist a number of reinforcement learning algorithms which learn by climbing the gradient of expected reward. Their long-run convergence has been proved, even in partially observable environments with non-deterministic actions, and without the need for a system model. However, the variance of the gradient estimator has been found to be a significant practical problem. Recent approaches have discounted future rewards, introducing a bias-variance trade-off into the gradient estimate. We incorporate a reward baseline into the learning system, and show that it affects variance without introducing further bias. In particular, as we approach the zerobias, high-variance parametedzation, the optimal (or variance minimizing) constant reward baseline is equal to the long-term average expected reward. Modified policy-gradient algorithms are presented, and a number of experiments demonstrate their improvement over previous work.