The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
PAC Bounds for Multi-armed Bandit and Markov Decision Processes
COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
The Journal of Machine Learning Research
Bandit based monte-carlo planning
ECML'06 Proceedings of the 17th European conference on Machine Learning
Convergence Rates of Efficient Global Optimization Algorithms
The Journal of Machine Learning Research
Hierarchical Knowledge Gradient for Sequential Sampling
The Journal of Machine Learning Research
Multi-armed bandits with episode context
Annals of Mathematics and Artificial Intelligence
Dynamic pricing with limited supply
Proceedings of the 13th ACM Conference on Electronic Commerce
Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search
Journal of Artificial Intelligence Research
Hi-index | 0.00 |
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of strategies that perform an online exploration of the arms. The strategies are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time.We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.