Following the Perturbed Leader to Gamble at Multi-armed Bandits

Authors:
Jussi Kujala;Tapio Elomaa
Affiliations:
Institute of Software Systems, Tampere University of Technology, P. O. Box 553, FI-33101 Tampere, Finland;Institute of Software Systems, Tampere University of Technology, P. O. Box 553, FI-33101 Tampere, Finland
Venue:
ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Year:
2007

Citing 11
Cited 0

The weighted majority algorithm

Information and Computation
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Using upper confidence bounds for online learning

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Adaptive Online Prediction by Following the Perturbed Leader

The Journal of Machine Learning Research
Robbing the bandit: less regret in online geometric optimization against an adaptive adversary

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Efficient algorithms for online decision problems

Journal of Computer and System Sciences - Special issue: Learning theory 2003
Prediction, Learning, and Games

Prediction, Learning, and Games
On following the perturbed leader in the bandit setting

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Multi-armed bandit algorithms and empirical evaluation

ECML'05 Proceedings of the 16th European conference on Machine Learning
Tracking the best of many experts

COLT'05 Proceedings of the 18th annual conference on Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Following the perturbed leader (fpl) is a powerful technique for solving online decision problems. Kalai and Vempala [1] rediscovered this algorithm recently. A traditional model for online decision problems is the multi-armed bandit. In it a gambler has to choose at each round one of the klevers to pull with the intention to minimize the cumulated cost. There are four versions of the nonstochastic optimization setting out of which the most demanding one is a game played against an adaptive adversary in the bandit setting. An adaptive adversary may alter its game strategy of assigning costs to decisions depending on the decisions chosen by the gambler in the past. In the bandit setting the gambler only gets to know the cost of the choice he made, rather than the costs of all available alternatives. In this work we show that the very straightforward and easy to implement algorithm Adaptive Bandit fplcan attain a regret of $O(\sqrt{T \ln T})$ against an adaptive adversary. This regret holds with respect to the best lever in hindsight and matches the previous best regret bounds of $O(\sqrt{T \ln T})$.