Pure exploration in multi-armed bandits problems

Authors:
Sébastien Bubeck;Rémi Munos;Gilles Stoltz
Affiliations:
INRIA Lille, SequeL, France;INRIA Lille, SequeL, France;Ecole normale supérieure, CNRS, Paris, France and HEC Paris, CNRS, Jouy-en-Josas, France
Venue:
ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Year:
2009

Citing 5
Cited 5

The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
PAC Bounds for Multi-armed Bandit and Markov Decision Processes

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Bandit based monte-carlo planning

ECML'06 Proceedings of the 17th European conference on Machine Learning

Convergence Rates of Efficient Global Optimization Algorithms

The Journal of Machine Learning Research
Hierarchical Knowledge Gradient for Sequential Sampling

The Journal of Machine Learning Research
Multi-armed bandits with episode context

Annals of Mathematics and Artificial Intelligence
Dynamic pricing with limited supply

Proceedings of the 13th ACM Conference on Electronic Commerce
Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of strategies that perform an online exploration of the arms. The strategies are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time.We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.