Likelihood ratio gradient estimation for stochastic systems
Communications of the ACM - Special issue on simulation
A survey of algorithmic methods for partially observed Markov decision processes
Annals of Operations Research
How to optimize discrete-event systems from a single sample path by the score function method
Annals of Operations Research
Linear least-squares algorithms for temporal difference learning
Machine Learning - Special issue on reinforcement learning
Planning and acting in partially observable stochastic domains
Artificial Intelligence
Gradient descent for general reinforcement learning
Proceedings of the 1998 conference on Advances in neural information processing systems II
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Learning to Predict by the Methods of Temporal Differences
Machine Learning
Reinforcement Learning in POMDPs with Function Approximation
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
SIAM Journal on Control and Optimization
Infinite-horizon policy-gradient estimation
Journal of Artificial Intelligence Research
Experiments with infinite-horizon, policy-gradient estimation
Journal of Artificial Intelligence Research
Some inequalities for information divergence and related measures of discrimination
IEEE Transactions on Information Theory
Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
The Journal of Machine Learning Research
A semiparametric statistical approach to model-free policy evaluation
Proceedings of the 25th international conference on Machine learning
The factored policy-gradient planner
Artificial Intelligence
Learning when to stop thinking and do something!
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Geometric variance reduction in Markov chains: application to value function and gradient estimation
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
A variance analysis for POMDP policy evaluation
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Natural actor-critic algorithms
Automatica (Journal of IFAC)
On-Line Policy Gradient Estimation with Multi-Step Sampling
Discrete Event Dynamic Systems
A Convergent Online Single Time Scale Actor Critic Algorithm
The Journal of Machine Learning Research
Analysis and improvement of policy gradient estimation
Neural Networks
Analysis of a natural gradient algorithm on monotonic convex-quadratic-composite functions
Proceedings of the 14th annual conference on Genetic and evolutionary computation
Efficient sample reuse in policy gradients with parameter-based exploration
Neural Computation
Reinforcement learning in robotics: A survey
International Journal of Robotics Research
Hi-index | 0.00 |
Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system. For the baseline technique, we compute the optimal baseline, and show that the popular approach of using the average reward to define the baseline can be suboptimal. For actor-critic algorithms, we show that using the true value function as the critic can be suboptimal. We also discuss algorithms for estimating the optimal baseline and approximate value function.