Recursive estimation and time-series analysis: an introduction
Recursive estimation and time-series analysis: an introduction
Likelihood ratio gradient estimation for stochastic systems
Communications of the ACM - Special issue on simulation
How to optimize discrete-event systems from a single sample path by the score function method
Annals of Operations Research
Linear least-squares algorithms for temporal difference learning
Machine Learning - Special issue on reinforcement learning
Incremental multi-step Q-learning
Machine Learning - Special issue on reinforcement learning
Gradient descent for general reinforcement learning
Proceedings of the 1998 conference on Advances in neural information processing systems II
Dynamic Programming and Optimal Control
Dynamic Programming and Optimal Control
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Neuro-Dynamic Programming
On Average Versus Discounted Reward Temporal-Difference Learning
Machine Learning
Technical Update: Least-Squares Temporal Difference Learning
Machine Learning
Learning to Predict by the Methods of Temporal Differences
Machine Learning
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
SIAM Journal on Control and Optimization
Least-squares policy iteration
The Journal of Machine Learning Research
Information Theory, Inference & Learning Algorithms
Information Theory, Inference & Learning Algorithms
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
The Journal of Machine Learning Research
Reinforcement Learning in Continuous Time and Space
Neural Computation
Neurocomputing
A semiparametric statistical approach to model-free policy evaluation
Proceedings of the 25th international conference on Machine learning
Infinite-horizon policy-gradient estimation
Journal of Artificial Intelligence Research
Experiments with infinite-horizon, policy-gradient estimation
Journal of Artificial Intelligence Research
ECML'05 Proceedings of the 16th European conference on Machine Learning
Brief paper: Average cost temporal-difference learning
Automatica (Journal of IFAC)
Extended spatial and temporal learning scale in reinforcement learning
CIMMACS '10 Proceedings of the 9th WSEAS international conference on computational intelligence, man-machine systems and cybernetics
Hi-index | 0.00 |
Most conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate γ for the value functions close to 1, these algorithms do not permit γ to be set exactly at γ = 1. In this article, we propose a method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distribution through backward Markov chain formulation and a temporal difference learning framework. A new policy gradient (PG) framework with an LSD is also proposed, in which the average reward gradient can be estimated by setting γ = 0, so it becomes unnecessary to learn the value functions. We also test the performance of the proposed algorithms using simple benchmark tasks and show that these can improve the performances of existing PG methods.