Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning

Authors:
Tetsuro Morimura;Eiji Uchibe;Junichiro Yoshimoto;Jan Peters;Kenji Doya
Affiliations:
-;-;-;-;-
Venue:
Neural Computation
Year:
2010

Citing 25
Cited 1

Recursive estimation and time-series analysis: an introduction

Recursive estimation and time-series analysis: an introduction
Likelihood ratio gradient estimation for stochastic systems

Communications of the ACM - Special issue on simulation
How to optimize discrete-event systems from a single sample path by the score function method

Annals of Operations Research
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Linear least-squares algorithms for temporal difference learning

Machine Learning - Special issue on reinforcement learning
Incremental multi-step Q-learning

Machine Learning - Special issue on reinforcement learning
Gradient descent for general reinforcement learning

Proceedings of the 1998 conference on Advances in neural information processing systems II
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
On Average Versus Discounted Reward Temporal-Difference Learning

Machine Learning
Technical Update: Least-Squares Temporal Difference Learning

Machine Learning
Learning to Predict by the Methods of Temporal Differences

Machine Learning
An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
On Actor-Critic Algorithms

SIAM Journal on Control and Optimization
Least-squares policy iteration

The Journal of Machine Learning Research
Information Theory, Inference & Learning Algorithms

Information Theory, Inference & Learning Algorithms
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

The Journal of Machine Learning Research
Reinforcement Learning in Continuous Time and Space

Neural Computation
Natural Actor-Critic

Neurocomputing
A semiparametric statistical approach to model-free policy evaluation

Proceedings of the 25th international conference on Machine learning
Infinite-horizon policy-gradient estimation

Journal of Artificial Intelligence Research
Experiments with infinite-horizon, policy-gradient estimation

Journal of Artificial Intelligence Research
Natural actor-critic

ECML'05 Proceedings of the 16th European conference on Machine Learning
Brief paper: Average cost temporal-difference learning

Automatica (Journal of IFAC)

Extended spatial and temporal learning scale in reinforcement learning

CIMMACS '10 Proceedings of the 9th WSEAS international conference on computational intelligence, man-machine systems and cybernetics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate γ for the value functions close to 1, these algorithms do not permit γ to be set exactly at γ = 1. In this article, we propose a method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distribution through backward Markov chain formulation and a temporal difference learning framework. A new policy gradient (PG) framework with an LSD is also proposed, in which the average reward gradient can be estimated by setting γ = 0, so it becomes unnecessary to learn the value functions. We also test the performance of the proposed algorithms using simple benchmark tasks and show that these can improve the performances of existing PG methods.