V-MAX: tempered optimism for better PAC reinforcement learning

  • Authors:
  • Karun Rao;Shimon Whiteson

  • Affiliations:
  • University of Amsterdam, Amsterdam, The Netherlands;University of Amsterdam, Amsterdam, The Netherlands

  • Venue:
  • Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent advances in reinforcement learning have yielded several PAC-MDP algorithms that, using the principle of optimism in the face of uncertainty, are guaranteed to act near-optimally with high probability on all but a polynomial number of samples. Unfortunately, many of these algorithms, such as R-MAX, perform poorly in practice because their initial exploration in each state, before the associated model parameters have been learned with confidence, is random. Others, such as Model-Based Interval Estimation (MBIE) have weaker sample complexity bounds and require careful parameter tuning. This paper proposes a new PAC-MDP algorithm called V-MAX designed to address these problems. By restricting its optimism to future visits, V-MAX can exploit its experience early in learning and thus obtain more cumulative reward than R-MAX. Furthermore, doing so does not compromise the quality of exploration, as we prove bounds on the sample complexity of V-MAX that are identical to those of R-MAX. Finally, we present empirical results in two domains demonstrating that V-MAX can substantially outperform R-MAX and match or outperform MBIE while being easier to tune, as its performance is invariant to conservative choices of its primary parameter.