Natural actor-critic algorithms

Authors:
Shalabh Bhatnagar;Richard S. Sutton;Mohammad Ghavamzadeh;Mark Lee
Affiliations:
Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India;The RLAI Laboratory, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8;INRIA Lille - Nord Europe, Team SequeL, France;The RLAI Laboratory, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8
Venue:
Automatica (Journal of IFAC)
Year:
2009

Citing 31
Cited 13

Convergent activation dynamics in continuous time networks

Neural Networks
Likelihood ratio gradient estimation for stochastic systems

Communications of the ACM - Special issue on simulation
Adaptive algorithms and stochastic approximations

Adaptive algorithms and stochastic approximations
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Asynchronous Stochastic Approximation and Q-Learning

Machine Learning
Temporal difference learning and TD-Gammon

Communications of the ACM
Linear least-squares algorithms for temporal difference learning

Machine Learning - Special issue on reinforcement learning
Stochastic approximation with two time scales

Systems & Control Letters
Natural gradient works efficiently in learning

Neural Computation
Some Pathological Traps for Stochastic Approximation

SIAM Journal on Control and Optimization
Elevator Group Control Using Multiple Reinforcement Learning Agents

Machine Learning
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning

SIAM Journal on Control and Optimization
Actor-Critic--Type Learning Algorithms for Markov Decision Processes

SIAM Journal on Control and Optimization
On the Convergence of Temporal-Difference Learning with Linear Function Approximation

Machine Learning
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Learning Algorithms for Markov Decision Processes with Average Cost

SIAM Journal on Control and Optimization
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Least-Squares Temporal Difference Learning

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Temporal credit assignment in reinforcement learning

Temporal credit assignment in reinforcement learning
On Actor-Critic Algorithms

SIAM Journal on Control and Optimization
Least-squares policy iteration

The Journal of Machine Learning Research
Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

The Journal of Machine Learning Research
Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

Discrete Event Dynamic Systems
Bayesian actor-critic algorithms

Proceedings of the 24th international conference on Machine learning
Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Natural Actor-Critic

Neurocomputing
Control Techniques for Complex Networks

Control Techniques for Complex Networks
Infinite-horizon policy-gradient estimation

Journal of Artificial Intelligence Research
Covariant policy search

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Brief paper: Average cost temporal-difference learning

Automatica (Journal of IFAC)

Real-time reinforcement learning by sequential Actor-Critics and experience replay

Neural Networks
A Convergent Online Single Time Scale Actor Critic Algorithm

The Journal of Machine Learning Research
Hessian matrix distribution for Bayesian policy gradient reinforcement learning

Information Sciences: an International Journal
Preference-based policy iteration: leveraging preference learning for reinforcement learning

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
Actor-Critic algorithm based on incremental least-squares temporal difference with eligibility trace

ICIC'11 Proceedings of the 7th international conference on Advanced Intelligent Computing Theories and Applications: with aspects of artificial intelligence
Analysis of a natural gradient algorithm on monotonic convex-quadratic-composite functions

Proceedings of the 14th annual conference on Genetic and evolutionary computation
Unified inter and intra options learning using policy gradient methods

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Using approximate dynamic programming to optimize admission control in cloud computing environment

Proceedings of the Winter Simulation Conference
2013 Special Issue: Autonomous reinforcement learning with experience replay

Neural Networks
Dynamic policy programming

The Journal of Machine Learning Research
Learning via human feedback in continuous state and action spaces

Applied Intelligence
Reinforcement learning algorithms with function approximation: Recent advances and applications

Information Sciences: an International Journal
Policy oscillation is overshooting

Neural Networks

Quantified Score

Hi-index	22.14

Visualization

Abstract

We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.