Better algorithms for benign bandits

Authors:
Elad Hazan;Satyen Kale
Affiliations:
IBM Almaden, San Jose, CA;Microsoft Research, One Microsoft Way, Redmond, WA
Venue:
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Year:
2009

Citing 9
Cited 3

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Asymptotically efficient adaptive control in stochastic regression models

Advances in Applied Mathematics
Lectures on modern convex optimization: analysis, algorithms, and engineering applications

Lectures on modern convex optimization: analysis, algorithms, and engineering applications
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Adaptive routing with end-to-end feedback: distributed learning and geometric approaches

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Online convex optimization in the bandit setting: gradient descent without a gradient

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Robbing the bandit: less regret in online geometric optimization against an adaptive adversary

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Prediction, Learning, and Games

Prediction, Learning, and Games
Improved second-order bounds for prediction with expert advice

Machine Learning

Sharp dichotomies for regret minimization in metric spaces

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Better Algorithms for Benign Bandits

The Journal of Machine Learning Research
Ranked bandits in metric spaces: learning diverse rankings over large document collections

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. Perhaps the most general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some Euclidean space, and the cost functions are linear. Only recently an efficient algorithm attaining Õ (√T) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which obtains a regret bound of Õ (√Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling.