PAC Bounds for Multi-armed Bandit and Markov Decision Processes

  • Authors:
  • Eyal Even-Dar;Shie Mannor;Yishay Mansour

  • Affiliations:
  • -;-;-

  • Venue:
  • COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/驴2 log 1/驴) times to find an 驴-optimal arm with probability of at least 1 - 驴. This is in contrast to the naive bound of O(n/驴2 log n/驴). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound. We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learningalg orithm for Markov Decision Processes. This is done essentially by simulatingV alue Iteration, and in each iteration invokingt he multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.