The weighted majority algorithm
Information and Computation
Machine Learning - Special issue on context sensitivity and concept drift
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
Prediction, Learning, and Games
Prediction, Learning, and Games
The Journal of Machine Learning Research
Near-optimal Regret Bounds for Reinforcement Learning
The Journal of Machine Learning Research
On upper-confidence bound policies for switching bandit problems
ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Hi-index | 0.00 |
We consider a sequential decision problem where the rewards are generated by a piecewise-stationary distribution. However, the different reward distributions are unknown and may change at unknown instants. Our approach uses a limited number of side observations on past rewards, but does not require prior knowledge of the frequency of changes. In spite of the adversarial nature of the reward process, we provide an algorithm whose regret, with respect to the baseline with perfect knowledge of the distributions and the changes, is O(k log(T)), where k is the number of changes up to time T. This is in contrast to the case where side observations are not available, and where the regret is at least Ω(√T).