The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

Authors:
Shie Mannor;John N. Tsitsiklis
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2004

Citing 5
Cited 18

Learning in Neural Networks: Theoretical Foundations

Learning in Neural Networks: Theoretical Foundations
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
PAC Bounds for Multi-armed Bandit and Markov Decision Processes

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

The Journal of Machine Learning Research
Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

ALT '08 Proceedings of the 19th international conference on Algorithmic Learning Theory
Efficient Reinforcement Learning in Parameterized Models: Discrete Parameter Case

Recent Advances in Reinforcement Learning
The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Optimal contraction theorem for exploration-exploitation tradeoff in search and optimization

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Reinforcement Learning in Finite MDPs: PAC Analysis

The Journal of Machine Learning Research
Online regret bounds for Markov decision processes with deterministic transitions

Theoretical Computer Science
Pure exploration in multi-armed bandits problems

ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Near-optimal Regret Bounds for Reinforcement Learning

The Journal of Machine Learning Research
Pure exploration in finitely-armed and continuous-armed bandits

Theoretical Computer Science
Learning to trade off between exploration and exploitation in multiclass bandit prediction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Multi-armed bandits with episode context

Annals of Mathematics and Artificial Intelligence
Nearly optimal exploration-exploitation decision thresholds

ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
The K-armed dueling bandits problem

Journal of Computer and System Sciences
PAC bounds for discounted MDPs

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Exploration / exploitation trade-off in mobile context-aware recommender systems

AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Machine Learning
Sample complexity of risk-averse bandit-arm selection

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the multi-armed bandit problem under the PAC ("probably approximately correct") model. It was shown by Even-Dar et al. (2002) that given n arms, a total of O((n/ε2)log(1/δ)) trials suffices in order to find an ε-optimal arm with probability at least 1-δ. We establish a matching lower bound on the expected number of trials under any sampling policy. We furthermore generalize the lower bound, and show an explicit dependence on the (unknown) statistics of the arms. We also provide a similar bound within a Bayesian setting. The case where the statistics of the arms are known but the identities of the arms are not, is also discussed. For this case, we provide a lower bound of Θ((1/ε2)(n+log(1/δ))) on the expected number of trials, as well as a sampling policy with a matching upper bound. If instead of the expected number of trials, we consider the maximum (over all sample paths) number of trials, we establish a matching upper and lower bound of the form Θ((n/ε2)log(1/δ)). Finally, we derive lower bounds on the expected regret, in the spirit of Lai and Robbins.