TD-Gammon, a self-teaching backgammon program, achieves master-level play
Neural Computation
Using expectation-maximization for reinforcement learning
Neural Computation
Genetic Algorithms in Search, Optimization and Machine Learning
Genetic Algorithms in Search, Optimization and Machine Learning
Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation
Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation
Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes
Discrete Event Dynamic Systems
Genetic Programming IV: Routine Human-Competitive Machine Intelligence
Genetic Programming IV: Routine Human-Competitive Machine Intelligence
Least-squares policy iteration
The Journal of Machine Learning Research
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
The Journal of Machine Learning Research
Partially observable Markov decision processes for spoken dialog systems
Computer Speech and Language
Reinforcement learning: a survey
Journal of Artificial Intelligence Research
Experiments with infinite-horizon, policy-gradient estimation
Journal of Artificial Intelligence Research
2010 Special Issue: Parameter-exploring policy gradients
Neural Networks
Optimizing debt collections using constrained reinforcement learning
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
The optimal reward baseline for gradient-based reinforcement learning
UAI'01 Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence
Efficient sample reuse in policy gradients with parameter-based exploration
Neural Computation
Hi-index | 0.00 |
Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments.