V-MAX: tempered optimism for better PAC reinforcement learning

Authors:
Karun Rao;Shimon Whiteson
Affiliations:
University of Amsterdam, Amsterdam, The Netherlands;University of Amsterdam, Amsterdam, The Netherlands
Venue:
Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Year:
2012

Citing 17
Cited 0

A theory of the learnable

Communications of the ACM
Technical Note: \cal Q-Learning

Machine Learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
Optimal learning: computational procedures for bayes-adaptive markov decision processes

Optimal learning: computational procedures for bayes-adaptive markov decision processes
R-max - a general polynomial time algorithm for near-optimal reinforcement learning

The Journal of Machine Learning Research
An Empirical Evaluation of Interval Estimation for Markov Decision Processes

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
A theoretical analysis of Model-Based Interval Estimation

ICML '05 Proceedings of the 22nd international conference on Machine learning
Hierarchical model-based reinforcement learning: R-max + MAXQ

Proceedings of the 25th international conference on Machine learning
The many faces of optimism: a unifying approach

Proceedings of the 25th international conference on Machine learning
An analysis of model-based Interval Estimation for Markov Decision Processes

Journal of Computer and System Sciences
Efficient reinforcement learning with relocatable action models

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Reinforcement Learning in Finite MDPs: PAC Analysis

The Journal of Machine Learning Research
Model-based exploration in continuous state spaces

SARA'07 Proceedings of the 7th International conference on Abstraction, reformulation, and approximation
Near-optimal Regret Bounds for Reinforcement Learning

The Journal of Machine Learning Research
Knows what it knows: a framework for self-aware learning

Machine Learning
Efficient planning in R-max

The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent advances in reinforcement learning have yielded several PAC-MDP algorithms that, using the principle of optimism in the face of uncertainty, are guaranteed to act near-optimally with high probability on all but a polynomial number of samples. Unfortunately, many of these algorithms, such as R-MAX, perform poorly in practice because their initial exploration in each state, before the associated model parameters have been learned with confidence, is random. Others, such as Model-Based Interval Estimation (MBIE) have weaker sample complexity bounds and require careful parameter tuning. This paper proposes a new PAC-MDP algorithm called V-MAX designed to address these problems. By restricting its optimism to future visits, V-MAX can exploit its experience early in learning and thus obtain more cumulative reward than R-MAX. Furthermore, doing so does not compromise the quality of exploration, as we prove bounds on the sample complexity of V-MAX that are identical to those of R-MAX. Finally, we present empirical results in two domains demonstrating that V-MAX can substantially outperform R-MAX and match or outperform MBIE while being easier to tune, as its performance is invariant to conservative choices of its primary parameter.