On following the perturbed leader in the bandit setting

Authors:
Jussi Kujala;Tapio Elomaa
Affiliations:
Institute of Software Systems, Tampere University of Technology, Tampere, Finland;Institute of Software Systems, Tampere University of Technology, Tampere, Finland
Venue:
ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Year:
2005

Citing 8
Cited 1

Self-adjusting binary search trees

Journal of the ACM (JACM)
The weighted majority algorithm

Information and Computation
How to use expert advice

Journal of the ACM (JACM)
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Path kernels and multiplicative updates

The Journal of Machine Learning Research
Adaptive routing with end-to-end feedback: distributed learning and geometric approaches

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Online convex optimization in the bandit setting: gradient descent without a gradient

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Adaptive Online Prediction by Following the Perturbed Leader

The Journal of Machine Learning Research

Following the Perturbed Leader to Gamble at Multi-armed Bandits

ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

In an online decision problem an algorithm is at each time step required to choose one of the feasible points without knowing the cost associated with it. An adversary assigns the cost to possible decisions either obliviously or adaptively. The online algorithm, naturally, attempts to collect as little cost as possible. The cost difference of the online algorithm and the best static decision in hindsight is called the regret of the algorithm. Kalai and Vempala [1] showed that it is possible to have efficient solutions to some problems with a linear cost function by following the perturbed leader. Their solution requires the costs of all decisions to be known.Recently there has also been some progress in the bandit setting, where only the cost of the selected decision is observed. A bound of O(T2/3) on T rounds was first shown by Awerbuch and Kleinberg [2] for the regret against an oblivious adversary and later McMahan and Blum [3] showed that a bound of $O(\sqrt{\ln T}T^{3/4})$is obtainable against an adaptive adversary. In this paper we study Kalai and Vempala’s model from the viewpoint of bandit algorithms. We show that the algorithm of McMahan and Blum attains a regret of O(T2/3) against an oblivious adversary. Moreover, we show a tighter $O(\sqrt{m\ln m}\sqrt{T})$bound for the expert setting using m experts.