Combinatorial bandits

Authors:
Nicolò Cesa-Bianchi;GáBor Lugosi
Affiliations:
Universití degli Studi di Milano, Italy;ICREA and Pompeu Fabra University, Spain
Venue:
Journal of Computer and System Sciences
Year:
2012

Citing 13
Cited 0

Polynomial-time approximation algorithms for the Ising model

SIAM Journal on Computing
How to get a perfectly random sample from a generic Markov chain and generate a random spanning tree of a directed graph

Journal of Algorithms
Sampling spin configurations of an Ising system

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Path kernels and multiplicative updates

The Journal of Machine Learning Research
Convex Optimization

Convex Optimization
Adaptive routing with end-to-end feedback: distributed learning and geometric approaches

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries

Journal of the ACM (JACM)
Robbing the bandit: less regret in online geometric optimization against an adaptive adversary

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Efficient algorithms for online decision problems

Journal of Computer and System Sciences - Special issue: Learning theory 2003
Prediction, Learning, and Games

Prediction, Learning, and Games
The On-Line Shortest Path Problem Under Partial Monitoring

The Journal of Machine Learning Research
Learning permutations with exponential weights

COLT'07 Proceedings of the 20th annual conference on Learning theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study sequential prediction problems in which, at each time instance, the forecaster chooses a vector from a given finite set S@?R^d. At the same time, the opponent chooses a ''loss'' vector in R^d and the forecaster suffers a loss that is the inner product of the two vectors. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible element in S. We consider the ''bandit'' setting in which the forecaster only has access to the losses of the chosen vectors (i.e., the entire loss vectors are not observed). We introduce a variant of a strategy by Dani, Hayes and Kakade achieving a regret bound that, for a variety of concrete choices of S, is of order ndln|S| where n is the time horizon. This is not improvable in general and is better than previously known bounds. The examples we consider are all such that S@?{0,1}^d, and we show how the combinatorial structure of these classes can be exploited to improve the regret bounds. We also point out computationally efficient implementations for various interesting choices of S.