Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

  • Authors:
  • Mohammad Gheshlaghi Azar;Rémi Munos;Hilbert J. Kappen

  • Affiliations:
  • Department of Biophysics, Radboud University Nijmegen, EZ Nijmegen, The Netherlands 6525 and School of Computer Science, Carnegie Mellon University, Pittsburgh, USA 15213;INRIA Lille, SequeL Project, Villeneuve d'Ascq, France 59650;Department of Biophysics, Radboud University Nijmegen, EZ Nijmegen, The Netherlands 6525

  • Venue:
  • Machine Learning
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor 驴驴[0,1) only O(Nlog(N/驴)/((1驴驴)3 驴 2)) state-transition samples are required to find an 驴-optimal estimation of the action-value function with the probability (w.p.) 1驴驴. Further, we prove that, for small values of 驴, an order of O(Nlog(N/驴)/((1驴驴)3 驴 2)) samples is required to find an 驴-optimal policy w.p. 1驴驴. We also prove a matching lower bound of 驴(Nlog(N/驴)/((1驴驴)3 驴 2)) on the sample complexity of estimating the optimal action-value function with 驴 accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of N, 驴, 驴 and 1/(1驴驴) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1驴驴).