Communications of the ACM
Technical Note: \cal Q-Learning
Machine Learning
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Near-Optimal Reinforcement Learning in Polynomial Time
Machine Learning
Optimal learning: computational procedures for bayes-adaptive markov decision processes
Optimal learning: computational procedures for bayes-adaptive markov decision processes
R-max - a general polynomial time algorithm for near-optimal reinforcement learning
The Journal of Machine Learning Research
An Empirical Evaluation of Interval Estimation for Markov Decision Processes
ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
A theoretical analysis of Model-Based Interval Estimation
ICML '05 Proceedings of the 22nd international conference on Machine learning
Hierarchical model-based reinforcement learning: R-max + MAXQ
Proceedings of the 25th international conference on Machine learning
The many faces of optimism: a unifying approach
Proceedings of the 25th international conference on Machine learning
An analysis of model-based Interval Estimation for Markov Decision Processes
Journal of Computer and System Sciences
Efficient reinforcement learning with relocatable action models
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Reinforcement Learning in Finite MDPs: PAC Analysis
The Journal of Machine Learning Research
Model-based exploration in continuous state spaces
SARA'07 Proceedings of the 7th International conference on Abstraction, reformulation, and approximation
Near-optimal Regret Bounds for Reinforcement Learning
The Journal of Machine Learning Research
Knows what it knows: a framework for self-aware learning
Machine Learning
The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 3
Hi-index | 0.00 |
Recent advances in reinforcement learning have yielded several PAC-MDP algorithms that, using the principle of optimism in the face of uncertainty, are guaranteed to act near-optimally with high probability on all but a polynomial number of samples. Unfortunately, many of these algorithms, such as R-MAX, perform poorly in practice because their initial exploration in each state, before the associated model parameters have been learned with confidence, is random. Others, such as Model-Based Interval Estimation (MBIE) have weaker sample complexity bounds and require careful parameter tuning. This paper proposes a new PAC-MDP algorithm called V-MAX designed to address these problems. By restricting its optimism to future visits, V-MAX can exploit its experience early in learning and thus obtain more cumulative reward than R-MAX. Furthermore, doing so does not compromise the quality of exploration, as we prove bounds on the sample complexity of V-MAX that are identical to those of R-MAX. Finally, we present empirical results in two domains demonstrating that V-MAX can substantially outperform R-MAX and match or outperform MBIE while being easier to tune, as its performance is invariant to conservative choices of its primary parameter.