Regret Bounds and Minimax Policies under Partial Monitoring

Authors:
Jean-Yves Audibert;Sébastien Bubeck
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2010

Citing 12
Cited 5

How to use expert advice

Journal of the ACM (JACM)
Some label efficient learning results

COLT '97 Proceedings of the tenth annual conference on Computational learning theory
Analysis of two gradient-based algorithms for on-line regression

Journal of Computer and System Sciences
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Using confidence bounds for exploitation-exploration trade-offs

The Journal of Machine Learning Research
Prediction, Learning, and Games

Prediction, Learning, and Games
Adaptive Routing Using Expert Advice

The Computer Journal
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

Theoretical Computer Science
Hannan consistency in on-line learning in case of unbounded losses under partial monitoring

ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Minimizing regret with label efficient prediction

IEEE Transactions on Information Theory

Better Algorithms for Benign Bandits

The Journal of Machine Learning Research
Lipschitz bandits without the Lipschitz constant

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Dynamic pricing with limited supply

Proceedings of the 13th ACM Conference on Electronic Commerce
Optimistic Bayesian sampling in contextual-bandit problems

The Journal of Machine Learning Research
Thompson sampling: an asymptotically optimal finite-time analysis

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x)=exp(η x) + γ/K, INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. On the other hand with ψ(x)=(η/-x)q + γ/K, which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.