Stochastic approximation for Monte Carlo optimization
WSC '86 Proceedings of the 18th conference on Winter simulation
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Reinforcement Learning in POMDPs with Function Approximation
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Temporal credit assignment in reinforcement learning
Temporal credit assignment in reinforcement learning
Analysis and improvement of policy gradient estimation
Neural Networks
Learning to make predictions in partially observable environments without a generative model
Journal of Artificial Intelligence Research
Efficient sample reuse in policy gradients with parameter-based exploration
Neural Computation
Hi-index | 0.00 |
There exist a number of reinforcement learning algorithms which learn by climbing the gradient of expected reward. Their long-run convergence has been proved, even in partially observable environments with non-deterministic actions, and without the need for a system model. However, the variance of the gradient estimator has been found to be a significant practical problem. Recent approaches have discounted future rewards, introducing a bias-variance trade-off into the gradient estimate. We incorporate a reward baseline into the learning system, and show that it affects variance without introducing further bias. In particular, as we approach the zerobias, high-variance parametedzation, the optimal (or variance minimizing) constant reward baseline is equal to the long-term average expected reward. Modified policy-gradient algorithms are presented, and a number of experiments demonstrate their improvement over previous work.