On upper-confidence bound policies for switching bandit problems

Authors:
Aurélien Garivier;Eric Moulines
Affiliations:
Institut Telecom, Telecom ParisTech, Laboratoire LTCI, CNRS, UMR;Institut Telecom, Telecom ParisTech, Laboratoire LTCI, CNRS, UMR
Venue:
ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Year:
2011

Citing 9
Cited 0

A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Tracking the Best Expert

Machine Learning - Special issue on context sensitivity and concept drift
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Using confidence bounds for exploitation-exploration trade-offs

The Journal of Machine Learning Research
Prediction, Learning, and Games

Prediction, Learning, and Games
Regret Minimization Under Partial Monitoring

Mathematics of Operations Research
Tuning Bandit Algorithms in Stochastic Environments

ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Piecewise-stationary bandit problems with side observations

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many problems, such as cognitive radio, parameter control of a scanning tunnelling microscope or internet advertisement, can be modelled as non-stationary bandit problems where the distributions of rewards changes abruptly at unknown time instants. In this paper, we analyze two algorithms designed for solving this issue: discounted UCB (D-UCB) and sliding-window UCB (SW-UCB). We establish an upperbound for the expected regret by upper-bounding the expectation of the number of times suboptimal arms are played. The proof relies on an interesting Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lower-bound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor. Numerical simulations show that D-UCB and SW-UCB perform significantly better than existing soft-max methods like EXP3.S.