Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Authors:
Mohammad Gheshlaghi Azar;Rémi Munos;Hilbert J. Kappen
Affiliations:
Department of Biophysics, Radboud University Nijmegen, EZ Nijmegen, The Netherlands 6525 and School of Computer Science, Carnegie Mellon University, Pittsburgh, USA 15213;INRIA Lille, SequeL Project, Villeneuve d'Ascq, France 59650;Department of Biophysics, Radboud University Nijmegen, EZ Nijmegen, The Netherlands 6525
Venue:
Machine Learning
Year:
2013

Citing 15
Cited 0

A guided tour of Chernoff bounds

Information Processing Letters
An Upper Bound on the Loss from Approximate Optimal-Value Functions

Machine Learning
Finite-sample convergence rates for Q-learning and indirect algorithms

Proceedings of the 1998 conference on Advances in neural information processing systems II
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Prediction, Learning, and Games

Prediction, Learning, and Games
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

The Journal of Machine Learning Research
Dynamic Programming and Optimal Control, Vol. II

Dynamic Programming and Optimal Control, Vol. II
Reinforcement Learning in Finite MDPs: PAC Analysis

The Journal of Machine Learning Research
REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning
Near-optimal Regret Bounds for Reinforcement Learning

The Journal of Machine Learning Research
PAC bounds for discounted MDPs

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor 驴驴[0,1) only O(Nlog(N/驴)/((1驴驴)3 驴 2)) state-transition samples are required to find an 驴-optimal estimation of the action-value function with the probability (w.p.) 1驴驴. Further, we prove that, for small values of 驴, an order of O(Nlog(N/驴)/((1驴驴)3 驴 2)) samples is required to find an 驴-optimal policy w.p. 1驴驴. We also prove a matching lower bound of 驴(Nlog(N/驴)/((1驴驴)3 驴 2)) on the sample complexity of estimating the optimal action-value function with 驴 accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of N, 驴, 驴 and 1/(1驴驴) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1驴驴).