Gambling in a rigged casino: The adversarial multi-armed bandit problem

Authors:
P. Auer;N. Cesa-Bianchi;Y. Freund;R. E. Schapire
Affiliations:
-;-;-;-
Venue:
FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Year:
1995

Citing 0
Cited 81

On-line evaluation and prediction using linear functions

COLT '97 Proceedings of the tenth annual conference on Computational learning theory
Competitive solutions for online financial problems

ACM Computing Surveys (CSUR)
Reinforcement learning and mistake bounded algorithms

COLT '99 Proceedings of the twelfth annual conference on Computational learning theory
Individual sequence prediction—upper bounds and application for complexity

COLT '99 Proceedings of the twelfth annual conference on Computational learning theory
Probabilistic pricebots

Proceedings of the fifth international conference on Autonomous agents
Static optimality and dynamic search-optimality in lists and trees

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Online learning in online auctions

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Discrete Prediction Games with Arbitrary Feedback and Loss

COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
PAC Bounds for Multi-armed Bandit and Markov Decision Processes

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Adaptive Strategies and Regret Minimization in Arbitrarily Varying Markov Environments

COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Adapting to a reliable network path

Proceedings of the twenty-second annual symposium on Principles of distributed computing
The empirical Bayes envelope and regret minimization in competitive Markov decision processes

Mathematics of Operations Research
Using confidence bounds for exploitation-exploration trade-offs

The Journal of Machine Learning Research
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Adaptive routing with end-to-end feedback: distributed learning and geometric approaches

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Competitive on-line paging strategies for mobile users under delay constraints

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
The Role of Reactivity in Multiagent Learning

AAMAS '04 Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems - Volume 2
Online learning in online auctions

Theoretical Computer Science - Special issue: Online algorithms in memoriam, Steve Seiden
Online convex optimization in the bandit setting: gradient descent without a gradient

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Near-optimal online auctions

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Reinforcement learning for active model selection

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Hedged learning: regret-minimization with learning experts

ICML '05 Proceedings of the 22nd international conference on Machine learning
Robbing the bandit: less regret in online geometric optimization against an adaptive adversary

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
An adaptive algorithm for selecting profitable keywords for search-based advertising services

EC '06 Proceedings of the 7th ACM conference on Electronic commerce
Learning algorithms for online principal-agent problems (and selling goods online)

ICML '06 Proceedings of the 23rd international conference on Machine learning
Stochastic Approximations and Differential Inclusions, Part II: Applications

Mathematics of Operations Research
AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents

Machine Learning
No regrets about no-regret

Artificial Intelligence
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

The Journal of Machine Learning Research
An experts approach to strategy selection in multiagent meeting scheduling

Autonomous Agents and Multi-Agent Systems
Reactivity and Safe Learning in Multi-Agent Systems

Adaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Generalized multiagent learning with performance bound

Autonomous Agents and Multi-Agent Systems
Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

Theoretical Computer Science
A Reinforcement Learning Approach to Interval Constraint Propagation

Constraints
Efficient bandit algorithms for online multiclass prediction

Proceedings of the 25th international conference on Machine learning
Exploration scavenging

Proceedings of the 25th international conference on Machine learning
QoS-LI: QoS loss inference in disadvantaged networks -- part II

Proceedings of the 11th communications and networking simulation symposium
Competitive collaborative learning

Journal of Computer and System Sciences
Approximation algorithms for restless bandit problems

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Effective short-term opponent exploitation in simplified poker

Machine Learning
To create neuro-controlled game opponent from UCT-created data

Proceedings of the first ACM/SIGEVO Summit on Genetic and Evolutionary Computation
The offset tree for learning with partial labels

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A distributed reinforcement learning approach to mission survivability in tactical MANETs

Proceedings of the 5th Annual Workshop on Cyber Security and Information Intelligence Research: Cyber Security and Information Intelligence Challenges and Strategies
Experiments with Adaptive Transfer Rate in Reinforcement Learning

Knowledge Acquisition: Approaches, Algorithms and Applications
Performance bounded reinforcement learning in strategic interactions

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Effective short-term opponent exploitation in simplified poker

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Dynamic non-Bayesian decision making

Journal of Artificial Intelligence Research
Learning restart strategies

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Investigations of continual computation

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Monte-Carlo exploration for deterministic planning

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Automatic weight learning for multiple data sources when learning from demonstration

ICRA'09 Proceedings of the 2009 IEEE international conference on Robotics and Automation
Apple tasting

Information and Computation
Playing monotone games to understand learning behaviors

Theoretical Computer Science
To create intelligent adaptive neuro-controller of game opponent from UCT-created data

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 2
An interconnection game between mobile network operators: Hidden information forecasting using expert advice fusion

Computer Networks: The International Journal of Computer and Telecommunications Networking
Online learning in adversarial Lipschitz environments

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Algorithm selection as a bandit problem with unbounded losses

LION'10 Proceedings of the 4th international conference on Learning and intelligent optimization
Catch me if you can: an abnormality detection approach for collaborative spectrum sensing in cognitive radio networks

IEEE Transactions on Wireless Communications
Regret Bounds and Minimax Policies under Partial Monitoring

The Journal of Machine Learning Research
A non-cooperative game-theoretic approach to channel assignment in multi-channel multi-radio wireless networks

Wireless Networks
A dynamic programming strategy to balance exploration and exploitation in the bandit problem

Annals of Mathematics and Artificial Intelligence
Upper confidence trees with short term partial information

EvoApplications'11 Proceedings of the 2011 international conference on Applications of evolutionary computation - Volume Part I
Learning the demand curve in posted-price digital goods auctions

The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Hannan consistency in on-line learning in case of unbounded losses under partial monitoring

ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Defensive universal learning with experts

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Continuous experts and the binning algorithm

COLT'06 Proceedings of the 19th annual conference on Learning Theory
Learning to select negotiation strategies in multi-agent meeting scheduling

EPIA'05 Proceedings of the 12th Portuguese conference on Progress in Artificial Intelligence
Multi-armed bandit algorithms and empirical evaluation

ECML'05 Proceedings of the 16th European conference on Machine Learning
Competitive collaborative learning

COLT'05 Proceedings of the 18th annual conference on Learning Theory
FPL analysis for adaptive bandits

SAGA'05 Proceedings of the Third international conference on StochasticAlgorithms: foundations and applications
Unifying convergence and no-regret in multiagent learning

LAMAS'05 Proceedings of the First international conference on Learning and Adaption in Multi-Agent Systems
Competitive strategy for on-line leasing of depreciable equipment

Mathematical and Computer Modelling: An International Journal
Just add Pepper: extending learning algorithms for repeated matrix games to repeated Markov games

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Strong mitigation: nesting search for good policies within search for good reward

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Adaptive negotiating agents in dynamic games: outperforming human behavior in diverse societies

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 3
Online implicit agent modelling

Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Learning in real-time in repeated games using experts

Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Optimum Object Selection Made Easy

Wireless Personal Communications: An International Journal
Online learning for auction mechanism in bandit setting

Decision Support Systems
Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the expected per-round payoff of our algorithm approaches that of the best arm at the rate O(T/sup -1/3/), and we give an improved rate of convergence when the best arm has fairly low payoff. We also consider a setting in which the player has a team of "experts" advising him on which arm to play; here, we give a strategy that will guarantee expected payoff close to that of the best expert. Finally, we apply our result to the problem of learning to play an unknown repeated matrix game against an all-powerful adversary.