Learning in Neural Networks: Theoretical Foundations
Learning in Neural Networks: Theoretical Foundations
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
PAC Bounds for Multi-armed Bandit and Markov Decision Processes
COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Gambling in a rigged casino: The adversarial multi-armed bandit problem
FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
The Journal of Machine Learning Research
Online Regret Bounds for Markov Decision Processes with Deterministic Transitions
ALT '08 Proceedings of the 19th international conference on Algorithmic Learning Theory
Efficient Reinforcement Learning in Parameterized Models: Discrete Parameter Case
Recent Advances in Reinforcement Learning
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Optimal contraction theorem for exploration-exploitation tradeoff in search and optimization
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Reinforcement Learning in Finite MDPs: PAC Analysis
The Journal of Machine Learning Research
Online regret bounds for Markov decision processes with deterministic transitions
Theoretical Computer Science
Pure exploration in multi-armed bandits problems
ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Near-optimal Regret Bounds for Reinforcement Learning
The Journal of Machine Learning Research
Pure exploration in finitely-armed and continuous-armed bandits
Theoretical Computer Science
Learning to trade off between exploration and exploitation in multiclass bandit prediction
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Multi-armed bandits with episode context
Annals of Mathematics and Artificial Intelligence
Nearly optimal exploration-exploitation decision thresholds
ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
The K-armed dueling bandits problem
Journal of Computer and System Sciences
PAC bounds for discounted MDPs
ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Exploration / exploitation trade-off in mobile context-aware recommender systems
AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence
Sample complexity of risk-averse bandit-arm selection
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Hi-index | 0.00 |
We consider the multi-armed bandit problem under the PAC ("probably approximately correct") model. It was shown by Even-Dar et al. (2002) that given n arms, a total of O((n/ε2)log(1/δ)) trials suffices in order to find an ε-optimal arm with probability at least 1-δ. We establish a matching lower bound on the expected number of trials under any sampling policy. We furthermore generalize the lower bound, and show an explicit dependence on the (unknown) statistics of the arms. We also provide a similar bound within a Bayesian setting. The case where the statistics of the arms are known but the identities of the arms are not, is also discussed. For this case, we provide a lower bound of Θ((1/ε2)(n+log(1/δ))) on the expected number of trials, as well as a sampling policy with a matching upper bound. If instead of the expected number of trials, we consider the maximum (over all sample paths) number of trials, we establish a matching upper and lower bound of the form Θ((n/ε2)log(1/δ)). Finally, we derive lower bounds on the expected regret, in the spirit of Lai and Robbins.