Following the Perturbed Leader to Gamble at Multi-armed Bandits

  • Authors:
  • Jussi Kujala;Tapio Elomaa

  • Affiliations:
  • Institute of Software Systems, Tampere University of Technology, P. O. Box 553, FI-33101 Tampere, Finland;Institute of Software Systems, Tampere University of Technology, P. O. Box 553, FI-33101 Tampere, Finland

  • Venue:
  • ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Following the perturbed leader (fpl) is a powerful technique for solving online decision problems. Kalai and Vempala [1] rediscovered this algorithm recently. A traditional model for online decision problems is the multi-armed bandit. In it a gambler has to choose at each round one of the klevers to pull with the intention to minimize the cumulated cost. There are four versions of the nonstochastic optimization setting out of which the most demanding one is a game played against an adaptive adversary in the bandit setting. An adaptive adversary may alter its game strategy of assigning costs to decisions depending on the decisions chosen by the gambler in the past. In the bandit setting the gambler only gets to know the cost of the choice he made, rather than the costs of all available alternatives. In this work we show that the very straightforward and easy to implement algorithm Adaptive Bandit fplcan attain a regret of $O(\sqrt{T \ln T})$ against an adaptive adversary. This regret holds with respect to the best lever in hindsight and matches the previous best regret bounds of $O(\sqrt{T \ln T})$.