The weighted majority algorithm
Information and Computation
A decision-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Using upper confidence bounds for online learning
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Adaptive Online Prediction by Following the Perturbed Leader
The Journal of Machine Learning Research
Robbing the bandit: less regret in online geometric optimization against an adaptive adversary
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Efficient algorithms for online decision problems
Journal of Computer and System Sciences - Special issue: Learning theory 2003
Prediction, Learning, and Games
Prediction, Learning, and Games
On following the perturbed leader in the bandit setting
ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Multi-armed bandit algorithms and empirical evaluation
ECML'05 Proceedings of the 16th European conference on Machine Learning
Tracking the best of many experts
COLT'05 Proceedings of the 18th annual conference on Learning Theory
Hi-index | 0.00 |
Following the perturbed leader (fpl) is a powerful technique for solving online decision problems. Kalai and Vempala [1] rediscovered this algorithm recently. A traditional model for online decision problems is the multi-armed bandit. In it a gambler has to choose at each round one of the klevers to pull with the intention to minimize the cumulated cost. There are four versions of the nonstochastic optimization setting out of which the most demanding one is a game played against an adaptive adversary in the bandit setting. An adaptive adversary may alter its game strategy of assigning costs to decisions depending on the decisions chosen by the gambler in the past. In the bandit setting the gambler only gets to know the cost of the choice he made, rather than the costs of all available alternatives. In this work we show that the very straightforward and easy to implement algorithm Adaptive Bandit fplcan attain a regret of $O(\sqrt{T \ln T})$ against an adaptive adversary. This regret holds with respect to the best lever in hindsight and matches the previous best regret bounds of $O(\sqrt{T \ln T})$.