The Nonstochastic Multiarmed Bandit Problem

Authors:
Peter Auer;Nicolò Cesa-Bianchi;Yoav Freund;Robert E. Schapire
Affiliations:
-;-;-;-
Venue:
SIAM Journal on Computing
Year:
2003

Citing 0
Cited 124

Online oblivious routing

Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures
Online Choice of Active Learning Algorithms

The Journal of Machine Learning Research
A Geometric Approach to Multi-Criterion Reinforcement Learning

The Journal of Machine Learning Research
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Active model selection

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Online learning in online auctions

Theoretical Computer Science - Special issue: Online algorithms in memoriam, Steve Seiden
An adaptive pursuit strategy for allocating operator probabilities

GECCO '05 Proceedings of the 7th annual conference on Genetic and evolutionary computation
Near-optimal online auctions

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Contextual recommender problems [extended abstract]

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Anytime algorithms for multi-armed bandit problems

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Robbing the bandit: less regret in online geometric optimization against an adaptive adversary

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Online trading algorithms and robust option pricing

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Combining expert advice in reactive environments

Journal of the ACM (JACM)
Entropy-Driven online active learning for interactive calendar management

Proceedings of the 12th international conference on Intelligent user interfaces
Improved second-order bounds for prediction with expert advice

Machine Learning
Regret Minimization Under Partial Monitoring

Mathematics of Operations Research
Stochastic Approximations and Differential Inclusions, Part II: Applications

Mathematics of Operations Research
Online calibrated forecasts: Memory efficiency versus universality for learning in games

Machine Learning
Perspectives on multiagent learning

Artificial Intelligence
Multi-agent learning for engineers

Artificial Intelligence
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

The Journal of Machine Learning Research
Online linear optimization and adaptive routing

Journal of Computer and System Sciences
Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

Theoretical Computer Science
A Reinforcement Learning Approach to Interval Constraint Propagation

Constraints
Multi-armed bandits in metric spaces

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Learning diverse rankings with multi-armed bandits

Proceedings of the 25th international conference on Machine learning
Dual Strategy Active Learning

ECML '07 Proceedings of the 18th European conference on Machine Learning
Following the Perturbed Leader to Gamble at Multi-armed Bandits

ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

ALT '08 Proceedings of the 19th international conference on Algorithmic Learning Theory
Competitive collaborative learning

Journal of Computer and System Sciences
Markov Decision Processes with Arbitrary Reward Processes

Recent Advances in Reinforcement Learning
Better algorithms for benign bandits

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Piecewise-stationary bandit problems with side observations

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Characterizing truthful multi-armed bandit mechanisms: extended abstract

Proceedings of the 10th ACM conference on Electronic commerce
As Safe As It Gets: Near-Optimal Learning in Multi-Stage Games with Imperfect Monitoring

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Brief announcement: distributed algorithms for approximating wireless network capacity

Proceedings of the 28th ACM symposium on Principles of distributed computing
Optimal unbiased estimators for evaluating agent performance

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Markov Decision Processes with Arbitrary Reward Processes

Mathematics of Operations Research
The max K-armed bandit: a new model of exploration applied to search heuristic selection

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Restart schedules for ensembles of problem instances

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
An experts algorithm for transfer learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Two-sided bandits and the dating market

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Optimal contraction theorem for exploration-exploitation tradeoff in search and optimization

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms

Operations Research
Online learning in Markov decision processes with arbitrarily changing rewards and transitions

GameNets'09 Proceedings of the First ICST international conference on Game Theory for Networks
Regret Minimization and Job Scheduling

SOFSEM '10 Proceedings of the 36th Conference on Current Trends in Theory and Practice of Computer Science
Learning Permutations with Exponential Weights

The Journal of Machine Learning Research
Learning permutations with exponential weights

COLT'07 Proceedings of the 20th annual conference on Learning theory
Reinforcement learning-based load shared sequential routing

NETWORKING'07 Proceedings of the 6th international IFIP-TC6 conference on Ad Hoc and sensor networks, wireless networks, next generation internet
A contextual-bandit approach to personalized news article recommendation

Proceedings of the 19th international conference on World wide web
Online distributed sensor selection

Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks
Truthful mechanisms with implicit payment computation

Proceedings of the 11th ACM conference on Electronic commerce
Online regret bounds for Markov decision processes with deterministic transitions

Theoretical Computer Science
Pure exploration in multi-armed bandits problems

ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Optimizing genetic operator rates using a markov chain model of genetic algorithms

Proceedings of the 12th annual conference on Genetic and evolutionary computation
Distributed algorithms for approximating wireless network capacity

INFOCOM'10 Proceedings of the 29th conference on Information communications
From optimization to regret minimization and back again

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Enhancing cognitive radio dynamic spectrum sensing through adaptive learning

MILCOM'09 Proceedings of the 28th IEEE conference on Military communications
Near-optimal Regret Bounds for Reinforcement Learning

The Journal of Machine Learning Research
Approximation algorithms for restless bandit problems

Journal of the ACM (JACM)
Sharp dichotomies for regret minimization in metric spaces

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Online learning in adversarial Lipschitz environments

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Toward a classification of finite partial-monitoring games

ALT'10 Proceedings of the 21st international conference on Algorithmic learning theory
Algorithms for adversarial bandit problems with multiple plays

ALT'10 Proceedings of the 21st international conference on Algorithmic learning theory
Online multiple kernel learning: algorithms and mistake bounds

ALT'10 Proceedings of the 21st international conference on Algorithmic learning theory
Algorithm selection as a bandit problem with unbounded losses

LION'10 Proceedings of the 4th international conference on Learning and intelligent optimization
On learning algorithms for nash equilibria

SAGT'10 Proceedings of the Third international conference on Algorithmic game theory
No regret learning in oligopolies: cournot vs. bertrand

SAGT'10 Proceedings of the Third international conference on Algorithmic game theory
Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Proceedings of the fourth ACM international conference on Web search and data mining
Regret Bounds and Minimax Policies under Partial Monitoring

The Journal of Machine Learning Research
Pure exploration in finitely-armed and continuous-armed bandits

Theoretical Computer Science
Reliable end-user communication under a changing packet network protocol

Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Multiple tree for partially observable Monte-Carlo tree search

EvoApplications'11 Proceedings of the 2011 international conference on Applications of evolutionary computation - Volume Part I
Multiagent learning in large anonymous games

Journal of Artificial Intelligence Research
Better Algorithms for Benign Bandits

The Journal of Machine Learning Research
X-Armed Bandits

The Journal of Machine Learning Research
Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

The Journal of Machine Learning Research
Jamming-resistant communication in multi-channel multi-hop multi-path wireless networks

WASA'11 Proceedings of the 6th international conference on Wireless algorithms, systems, and applications
ShareBoost: boosting for multi-view learning with performance guarantees

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Computational randomness from generalized hardcore sets

FCT'11 Proceedings of the 18th international conference on Fundamentals of computation theory
Monte-carlo style UCT search for boolean satisfiability

AI*IA'11 Proceedings of the 12th international conference on Artificial intelligence around man and beyond
Lipschitz bandits without the Lipschitz constant

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
On upper-confidence bound policies for switching bandit problems

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Algorithm portfolio selection as a bandit problem with unbounded losses

Annals of Mathematics and Artificial Intelligence
Efficient Learning with Partially Observed Attributes

The Journal of Machine Learning Research
Multi-armed bandits with episode context

Annals of Mathematics and Artificial Intelligence
A simple distribution-free approach to the max k-armed bandit problem

CP'06 Proceedings of the 12th international conference on Principles and Practice of Constraint Programming
Bandit based monte-carlo planning

ECML'06 Proceedings of the 17th European conference on Machine Learning
Defensive universal learning with experts

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
On following the perturbed leader in the bandit setting

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Online learning with variable stage duration

COLT'06 Proceedings of the 19th annual conference on Learning Theory
The shortest path problem under partial monitoring

COLT'06 Proceedings of the 19th annual conference on Learning Theory
Multi-armed bandit algorithms and empirical evaluation

ECML'05 Proceedings of the 16th European conference on Machine Learning
Improved second-order bounds for prediction with expert advice

COLT'05 Proceedings of the 18th annual conference on Learning Theory
From external to internal regret

COLT'05 Proceedings of the 18th annual conference on Learning Theory
Understanding and protecting privacy: formal semantics and principled audit mechanisms

ICISS'11 Proceedings of the 7th international conference on Information Systems Security
FPL analysis for adaptive bandits

SAGA'05 Proceedings of the Third international conference on StochasticAlgorithms: foundations and applications
Online Learning and Online Convex Optimization

Foundations and Trends® in Machine Learning
Long-term information collection with energy harvesting wireless sensors: a multi-armed bandit based approach

Autonomous Agents and Multi-Agent Systems
Dynamic pricing with limited supply

Proceedings of the 13th ACM Conference on Electronic Commerce
Safe opponent exploitation

Proceedings of the 13th ACM Conference on Electronic Commerce
Combinatorial bandits

Journal of Computer and System Sciences
Learning with stochastic inputs and adversarial outputs

Journal of Computer and System Sciences
The K-armed dueling bandits problem

Journal of Computer and System Sciences
Ants easily solve stochastic shortest path problems

Proceedings of the 14th annual conference on Genetic and evolutionary computation
Learning where to attend with deep architectures for image tracking

Neural Computation
Quantitative Analysis of Systems Using Game-Theoretic Learning

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on CAPA'09, Special Section on WHS'09, and Special Section VCPSS' 09
Siblingrivalry: online autotuning through local competitions

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Regret bounds for restless markov bandits

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
New algorithms for budgeted learning

Machine Learning
Online Multiple Kernel Classification

Machine Learning
Toward a classification of finite partial-monitoring games

Theoretical Computer Science
Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising

Operations Research
Adaptive crowdsourcing for temporal crowds

Proceedings of the 22nd international conference on World Wide Web companion
Lower bounds and selectivity of weak-consistent policies in stochastic multi-armed bandit problem

The Journal of Machine Learning Research
Ranked bandits in metric spaces: learning diverse rankings over large document collections

The Journal of Machine Learning Research
Interactive collaborative filtering

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Joint Admission Control and Channel Selection Based on Multi Response Learning Automata (MRLA) in Cognitive Radio Networks

Wireless Personal Communications: An International Journal
Scheduling black-box mutational fuzzing

Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security
Lazy paired hyper-parameter tuning

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Sequential decision making with vector outcomes

Proceedings of the 5th conference on Innovations in theoretical computer science
Machine learning in an auction environment

Proceedings of the 23rd international conference on World wide web
Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers

Machine Learning
BoostingTree: parallel selection of weak learners in boosting, with application to ranking

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines.In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the per-round payoff of our algorithm approaches that of the best arm at the rate O(T-1/2). We show by a matching lower bound that this is the best possible.We also prove that our algorithm approaches the per-round payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of N strategies, then our algorithm approaches the per-round payoff of the strategy at the rate O((log N1/2 T-1/2). Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate O(T-1/2).