Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

Authors:
Jan Poland
Affiliations:
ABB Switzerland Ltd. Corporate Research, Segelhof, CH - 5405 Baden, Switzerland
Venue:
Theoretical Computer Science
Year:
2008

Citing 16
Cited 1

Aggregating strategies

COLT '90 Proceedings of the third annual workshop on Computational learning theory
Randomized algorithms

Randomized algorithms
An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Individual sequence prediction—upper bounds and application for complexity

COLT '99 Proceedings of the twelfth annual conference on Computational learning theory
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Universal Artificial Intelligence: Sequential Decisions Based On Algorithmic Probability

Universal Artificial Intelligence: Sequential Decisions Based On Algorithmic Probability
Adaptive Online Prediction by Following the Perturbed Leader

The Journal of Machine Learning Research
Anytime algorithms for multi-armed bandit problems

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Efficient algorithms for online decision problems

Journal of Computer and System Sciences - Special issue: Learning theory 2003
The weighted majority algorithm

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Defensive universal learning with experts

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Competitive collaborative learning

COLT'05 Proceedings of the 18th annual conference on Learning Theory
FPL analysis for adaptive bandits

SAGA'05 Proceedings of the Third international conference on StochasticAlgorithms: foundations and applications
Complexity-based induction systems: Comparisons and convergence theorems

IEEE Transactions on Information Theory

Online learning in adversarial Lipschitz environments

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II

Quantified Score

Hi-index	5.23

Visualization

Abstract

The nonstochastic multi-armed bandit problem, first studied by Auer, Cesa-Bianchi, Freund, and Schapire in 1995, is a game of repeatedly choosing one decision from a set of decisions (''experts''), under partial observation: In each round t, only the cost of the decision played is observable. A regret minimization algorithm plays this game while achieving sublinear regret relative to each decision. It is known that an adversary controlling the costs of the decisions can force the player a regret growing as t^1^2 in the time t. In this work, we propose the first algorithm for a countably infinite set of decisions, that achieves a regret upper bounded by O(t^1^2^+^@e), i.e. arbitrarily close to optimal order. To this aim, we build on the ''follow the perturbed leader'' principle, which dates back to work by Hannan in 1957. Our results hold against an adaptive adversary, for both the expected and high probability regret of the learner w.r.t. each decision. In the second part of the paper, we consider reactive problem settings, that is, situations where the learner's decisions impact on the future behaviour of the adversary, and a strong strategy can draw benefit from well chosen past actions. We present a variant of our regret minimization algorithm which has still regret of order at most t^1^2^+^@e relative to such strong strategies, and even sublinear regret not exceeding O(t^4^5) w.r.t. the hypothetical (without external interference) performance of a strong strategy. We show how to combine the regret minimizer with a universal class of experts, given by the countable set of programs on some fixed universal Turing machine. This defines a universal learner with sublinear regret relative to any computable strategy.