Dynamic policy programming

Authors:
Mohammad Gheshlaghi Azar;Vicenç Gómez;Hilbert J. Kappen
Affiliations:
Department of Biophysics, Radboud University Nijmegen, Nijmegen, The Netherlands;Department of Biophysics, Radboud University Nijmegen, Nijmegen, The Netherlands;Department of Biophysics, Radboud University Nijmegen, Nijmegen, The Netherlands
Venue:
The Journal of Machine Learning Research
Year:
2012

Citing 30
Cited 0

Technical Note: \cal Q-Learning

Machine Learning
The asymptotic convergence-rate of Q-learning

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Finite-sample convergence rates for Q-learning and indirect algorithms

Proceedings of the 1998 conference on Advances in neural information processing systems II
Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms

Machine Learning
On the existence of fixed points for approximate value iteration and temporal-difference learning

Journal of Optimization Theory and Applications
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
On Actor-Critic Algorithms

SIAM Journal on Control and Optimization
Least-squares policy iteration

The Journal of Machine Learning Research
Information Theory, Inference & Learning Algorithms

Information Theory, Inference & Learning Algorithms
Learning Rates for Q-learning

The Journal of Machine Learning Research
Interpolation-based Q-learning

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Tree-Based Batch Mode Reinforcement Learning

The Journal of Machine Learning Research
Prediction, Learning, and Games

Prediction, Learning, and Games
Natural Actor-Critic

Neurocomputing
On the convergence of stochastic iterative dynamic programming algorithms

Neural Computation
An analysis of reinforcement learning with function approximation

Proceedings of the 25th international conference on Machine learning
Finite-Time Bounds for Fitted Value Iteration

The Journal of Machine Learning Research
Dynamic Programming and Optimal Control, Vol. II

Dynamic Programming and Optimal Control, Vol. II
Regularized Fitted Q-Iteration: Application to Planning

Recent Advances in Reinforcement Learning
Model-free reinforcement learning as mixture learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Error bounds for approximate value iteration

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Infinite-horizon policy-gradient estimation

Journal of Artificial Intelligence Research
Covariant policy search

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Natural actor-critic algorithms

Automatica (Journal of IFAC)
Reinforcement Learning in Finite MDPs: PAC Analysis

The Journal of Machine Learning Research
REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning
Near-optimal Regret Bounds for Reinforcement Learning

The Journal of Machine Learning Research
Complexity analysis of real-time reinforcement learning

AAAI'93 Proceedings of the eleventh national conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic l∞-norm performance-loss bounds in the presence of approximation/ estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods.