A decision-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Machine Learning - Special issue on context sensitivity and concept drift
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
Using confidence bounds for exploitation-exploration trade-offs
The Journal of Machine Learning Research
Prediction, Learning, and Games
Prediction, Learning, and Games
Regret Minimization Under Partial Monitoring
Mathematics of Operations Research
Tuning Bandit Algorithms in Stochastic Environments
ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Piecewise-stationary bandit problems with side observations
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Hi-index | 0.00 |
Many problems, such as cognitive radio, parameter control of a scanning tunnelling microscope or internet advertisement, can be modelled as non-stationary bandit problems where the distributions of rewards changes abruptly at unknown time instants. In this paper, we analyze two algorithms designed for solving this issue: discounted UCB (D-UCB) and sliding-window UCB (SW-UCB). We establish an upperbound for the expected regret by upper-bounding the expectation of the number of times suboptimal arms are played. The proof relies on an interesting Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lower-bound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor. Numerical simulations show that D-UCB and SW-UCB perform significantly better than existing soft-max methods like EXP3.S.