Analysis and improvement of policy gradient estimation

Authors:
Tingting Zhao;Hirotaka Hachiya;Gang Niu;Masashi Sugiyama
Affiliations:
-;-;-;-
Venue:
Neural Networks
Year:
2012

Citing 15
Cited 1

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
TD-Gammon, a self-teaching backgammon program, achieves master-level play

Neural Computation
Using expectation-maximization for reinforcement learning

Neural Computation
Genetic Algorithms in Search, Optimization and Machine Learning

Genetic Algorithms in Search, Optimization and Machine Learning
Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation

Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation
Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes

Discrete Event Dynamic Systems
Genetic Programming IV: Routine Human-Competitive Machine Intelligence

Genetic Programming IV: Routine Human-Competitive Machine Intelligence
Least-squares policy iteration

The Journal of Machine Learning Research
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

The Journal of Machine Learning Research
Partially observable Markov decision processes for spoken dialog systems

Computer Speech and Language
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
Experiments with infinite-horizon, policy-gradient estimation

Journal of Artificial Intelligence Research
2010 Special Issue: Parameter-exploring policy gradients

Neural Networks
Optimizing debt collections using constrained reinforcement learning

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
The optimal reward baseline for gradient-based reinforcement learning

UAI'01 Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence

Efficient sample reuse in policy gradients with parameter-based exploration

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments.