Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
Bandit based monte-carlo planning
ECML'06 Proceedings of the 17th European conference on Machine Learning
Value of learning in sponsored search auctions
WINE'10 Proceedings of the 6th international conference on Internet and network economics
Regret Bounds and Minimax Policies under Partial Monitoring
The Journal of Machine Learning Research
Pure exploration in finitely-armed and continuous-armed bandits
Theoretical Computer Science
ShareBoost: boosting for multi-view learning with performance guarantees
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Deviations of stochastic bandit regret
ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Upper-confidence-bound algorithms for active learning in multi-armed bandits
ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Multi-armed bandits with episode context
Annals of Mathematics and Artificial Intelligence
Optimistic Bayesian sampling in contextual-bandit problems
The Journal of Machine Learning Research
Thompson sampling: an asymptotically optimal finite-time analysis
ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Lower bounds and selectivity of weak-consistent policies in stochastic multi-armed bandit problem
The Journal of Machine Learning Research
Automatic ad format selection via contextual bandits
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
On the doubt about margin explanation of boosting
Artificial Intelligence
Sample complexity of risk-averse bandit-arm selection
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Robustness of stochastic bandit policies
Theoretical Computer Science
Hi-index | 5.23 |
Algorithms based on upper confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. This paper considers a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. We provide the first analysis of the expected regret for such algorithms. As expected, our results show that the algorithm that uses the variance estimates has a major advantage over its alternatives that do not use such estimates provided that the variances of the payoffs of the suboptimal arms are low. We also prove that the regret concentrates only at a polynomial rate. This holds for all the upper confidence bound based algorithms and for all bandit problems except those special ones where with probability one the payoff obtained by pulling the optimal arm is larger than the expected payoff for the second best arm. Hence, although upper confidence bound bandit algorithms achieve logarithmic expected regret rates, they might not be suitable for a risk-averse decision maker. We illustrate some of the results by computer simulations.