Journal of the ACM (JACM)
Some label efficient learning results
COLT '97 Proceedings of the tenth annual conference on Computational learning theory
Analysis of two gradient-based algorithms for on-line regression
Journal of Computer and System Sciences
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
Gambling in a rigged casino: The adversarial multi-armed bandit problem
FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Using confidence bounds for exploitation-exploration trade-offs
The Journal of Machine Learning Research
Prediction, Learning, and Games
Prediction, Learning, and Games
Adaptive Routing Using Expert Advice
The Computer Journal
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
Theoretical Computer Science
Hannan consistency in on-line learning in case of unbounded losses under partial monitoring
ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Minimizing regret with label efficient prediction
IEEE Transactions on Information Theory
Better Algorithms for Benign Bandits
The Journal of Machine Learning Research
Lipschitz bandits without the Lipschitz constant
ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Dynamic pricing with limited supply
Proceedings of the 13th ACM Conference on Electronic Commerce
Optimistic Bayesian sampling in contextual-bandit problems
The Journal of Machine Learning Research
Thompson sampling: an asymptotically optimal finite-time analysis
ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Hi-index | 0.00 |
This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x)=exp(η x) + γ/K, INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. On the other hand with ψ(x)=(η/-x)q + γ/K, which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.