The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
The Journal of Machine Learning Research
A theoretical analysis of Model-Based Interval Estimation
ICML '05 Proceedings of the 22nd international conference on Machine learning
PAC model-free reinforcement learning
ICML '06 Proceedings of the 23rd international conference on Machine learning
An analysis of model-based Interval Estimation for Markov Decision Processes
Journal of Computer and System Sciences
Reinforcement Learning in Finite MDPs: PAC Analysis
The Journal of Machine Learning Research
Near-optimal Regret Bounds for Reinforcement Learning
The Journal of Machine Learning Research
Optimistic agents are asymptotically optimal
AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence
Hi-index | 0.00 |
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (mdps). We prove a new bound for a modified version of Upper Confidence Reinforcement Learning (ucrl) with only cubic dependence on the horizon. The bound is unimprovable in all parameters except the size of the state/action space, where it depends linearly on the number of non-zero transition probabilities. The lower bound strengthens previous work by being both more general (it applies to all policies) and tighter. The upper and lower bounds match up to logarithmic factors provided the transition matrix is not too dense.