Using confidence bounds for exploitation-exploration trade-offs

Authors:
Peter Auer
Affiliations:
Graz University of Technology, Institute for Theoretical Computer Science, Inffeldgasse 16b, A-8010 Graz, Austria
Venue:
The Journal of Machine Learning Research
Year:
2003

Citing 10
Cited 43

The weighted majority algorithm

Information and Computation
Associative Reinforcement Learning: Functions in k-DNF

Machine Learning
Associative Reinforcement Learning: A Generate and Test Algorithm

Machine Learning
Tracking the Best Disjunction

Machine Learning - Special issue on context sensitivity and concept drift
Tracking the Best Expert

Machine Learning - Special issue on context sensitivity and concept drift
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Associative Reinforcement Learning using Linear Probabilistic Concepts

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Learning to Optimally Schedule Internet Banner Advertisements

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Using upper confidence bounds for online learning

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science

Regret Minimization Under Partial Monitoring

Mathematics of Operations Research
An analysis of model-based Interval Estimation for Markov Decision Processes

Journal of Computer and System Sciences
Approximation algorithms for restless bandit problems

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Robust bounds for classification via selective sampling

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
The offset tree for learning with partial labels

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Boosting Active Learning to Optimality: A Tractable Monte-Carlo, Billiard-Based Algorithm

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Optimal contraction theorem for exploration-exploitation tradeoff in search and optimization

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
A contextual-bandit approach to personalized news article recommendation

Proceedings of the 19th international conference on World wide web
Exploitation and exploration in a performance based contextual advertising system

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Linearly Parameterized Bandits

Mathematics of Operations Research
Approximation algorithms for restless bandit problems

Journal of the ACM (JACM)
Sharp dichotomies for regret minimization in metric spaces

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Characterising enzymes for information processing: towards an artificial experimenter

UC'10 Proceedings of the 9th international conference on Unconventional computation
Exploration-exploitation of eye movement enriched multiple feature spaces for content-based image retrieval

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
Online learning in adversarial Lipschitz environments

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Reducing reinforcement learning to KWIK online regression

Annals of Mathematics and Artificial Intelligence
Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Proceedings of the fourth ACM international conference on Web search and data mining
Regret Bounds and Minimax Policies under Partial Monitoring

The Journal of Machine Learning Research
Revisiting Monte-Carlo tree search on a normal form game: NoGo

EvoApplications'11 Proceedings of the 2011 international conference on Applications of evolutionary computation - Volume Part I
A Monte-Carlo AIXI approximation

Journal of Artificial Intelligence Research
On upper-confidence bound policies for switching bandit problems

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Value-difference based exploration: adaptive control between epsilon-greedy and softmax

KI'11 Proceedings of the 34th Annual German conference on Advances in artificial intelligence
Adaptive noisy optimization

EvoApplicatons'10 Proceedings of the 2010 international conference on Applications of Evolutionary Computation - Volume Part I
Continuous upper confidence trees

LION'05 Proceedings of the 5th international conference on Learning and Intelligent Optimization
Learning with stochastic inputs and adversarial outputs

Journal of Computer and System Sciences
The K-armed dueling bandits problem

Journal of Computer and System Sciences
Optimistic Bayesian sampling in contextual-bandit problems

The Journal of Machine Learning Research
LogUCB: an explore-exploit algorithm for comments recommendation

Proceedings of the 21st ACM international conference on Information and knowledge management
Adaptive exploration using stochastic neurons

ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part II
Partial monitoring with side information

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Gradient algorithms for exploration/exploitation trade-offs: global and local variants

ANNPR'12 Proceedings of the 5th INNS IAPR TC 3 GIRPR conference on Artificial Neural Networks in Pattern Recognition
Upper confidence tree-based consistent reactive planning application to minesweeper

LION'12 Proceedings of the 6th international conference on Learning and Intelligent Optimization
Combinatorial network optimization with unknown variables: multi-armed bandits with linear rewards and individual observations

IEEE/ACM Transactions on Networking (TON)
Multiclass classification with bandit feedback using adaptive regularization

Machine Learning
Directing exploratory search: reinforcement learning from user interactions with keywords

Proceedings of the 2013 international conference on Intelligent user interfaces
Non stationary operator selection with island models

Proceedings of the 15th annual conference on Genetic and evolutionary computation
A unified search federation system based on online user feedback

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Mixing bandits: a recipe for improved cold-start recommendations in a social network

Proceedings of the 7th Workshop on Social Network Mining and Analysis
Ranked bandits in metric spaces: learning diverse rankings over large document collections

The Journal of Machine Learning Research
Directing exploratory search with interactive intent modeling

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Interactive collaborative filtering

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Upper confidence weighted learning for efficient exploration in multiclass prediction with binary feedback

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Geiringer theorems: from population genetics to computational intelligence, memory evolutive systems and Hebbian learning

Natural Computing: an international journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We show how a standard tool from statistics --- namely confidence bounds --- can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off. Our technique for designing and analyzing algorithms for such situations is general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on uncertain information provided by a random process. We apply our technique to two models with such an exploitation-exploration trade-off. For the adversarial bandit problem with shifting our new algorithm suffers only O((ST)1/2) regret with high probability over T trials with S shifts. Such a regret bound was previously known only in expectation. The second model we consider is associative reinforcement learning with linear value functions. For this model our technique improves the regret from O(T3/4) to O(T1/2).