Bayesian actor-critic algorithms

Authors:
Mohammad Ghavamzadeh;Yaakov Engel
Affiliations:
University of Alberta, Edmonton, Alberta, Canada;University of Alberta, Edmonton, Alberta, Canada
Venue:
Proceedings of the 24th international conference on Machine learning
Year:
2007

Citing 4
Cited 7

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Reinforcement learning with Gaussian processes

ICML '05 Proceedings of the 22nd international conference on Machine learning

Natural actor-critic algorithms

Automatica (Journal of IFAC)
Efficient Uncertainty Propagation for Reinforcement Learning with Limited Data

ICANN '09 Proceedings of the 19th International Conference on Artificial Neural Networks: Part I
A Generalized Path Integral Control Approach to Reinforcement Learning

The Journal of Machine Learning Research
Hessian matrix distribution for Bayesian policy gradient reinforcement learning

Information Sciences: an International Journal
A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

The Journal of Machine Learning Research
Reinforcement learning and the Bayesian control rule

AGI'11 Proceedings of the 4th international conference on Artificial general intelligence
Monte-Carlo tree search for Bayesian reinforcement learning

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new actor-critic learning model in which a Bayesian class of non-parametric critics, using Gaussian process temporal difference learning is used. Such critics model the state-action value function as a Gaussian process, allowing Bayes' rule to be used in computing the posterior distribution over state-action value functions, conditioned on the observed data. Appropriate choices of the prior covariance (kernel) between state-action values and of the parametrization of the policy allow us to obtain closed-form expressions for the posterior distribution of the gradient of the average discounted return with respect to the policy parameters. The posterior mean, which serves as our estimate of the policy gradient, is used to update the policy, while the posterior covariance allows us to gauge the reliability of the update.