Introduction to Stochastic Dynamic Programming: Probability and Mathematical
Introduction to Stochastic Dynamic Programming: Probability and Mathematical
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Learning from Scarce Experience
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Eligibility Traces for Off-Policy Policy Evaluation
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Temporal credit assignment in reinforcement learning
Temporal credit assignment in reinforcement learning
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
The Journal of Machine Learning Research
Neurocomputing
2010 Special Issue: Parameter-exploring policy gradients
Neural Networks
Reinforcement Learning and Dynamic Programming Using Function Approximators
Reinforcement Learning and Dynamic Programming Using Function Approximators
PEGASUS: a policy search method for large MDPs and POMDPs
UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Policy improvement for POMDPs using normalized importance sampling
UAI'01 Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence
The optimal reward baseline for gradient-based reinforcement learning
UAI'01 Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence
Analysis and improvement of policy gradient estimation
Neural Networks
Hi-index | 0.00 |
The policy gradient approach is a flexible and powerful reinforcement learning method particularly for problems with continuous actions such as robot control. A common challenge is how to reduce the variance of policy gradient estimates for reliable policy updates. In this letter, we combine the following three ideas and give a highly effective policy gradient method: 1 policy gradients with parameter-based exploration, a recently proposed policy search method with low variance of gradient estimates; 2 an importance sampling technique, which allows us to reuse previously gathered data in a consistent way; and 3 an optimal baseline, which minimizes the variance of gradient estimates with their unbiasedness being maintained. For the proposed method, we give a theoretical analysis of the variance of gradient estimates and show its usefulness through extensive experiments.