Near-Optimal Reinforcement Learning in Polynomial Time

  • Authors:
  • Michael Kearns;Satinder Singh

  • Affiliations:
  • Department of Computer and Information Science, University of Pennsylvania, Moore School Building, 200 South 33rd Street, Philadelphia, PA 19104-6389, USA. mkearns@cis.upenn.edu;Syntek Capital, New York, NY 10019, USA. satinder.baveja@syntekcapital.com

  • Venue:
  • Machine Learning
  • Year:
  • 2002

Quantified Score

Hi-index 0.02

Visualization

Abstract

We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.