Multi-armed bandits with episode context

Authors:
Christopher D. Rosin
Affiliations:
Parity Computing, Inc., San Diego, USA 92121
Venue:
Annals of Mathematics and Artificial Intelligence
Year:
2011

Citing 17
Cited 0

Computer Go: an AI oriented survey

Artificial Intelligence
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Experience-efficient learning in associative bandit problems

ICML '06 Proceedings of the 23rd international conference on Machine learning
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

The Journal of Machine Learning Research
Combining online and offline knowledge in UCT

Proceedings of the 24th international conference on Machine learning
Efficient bandit algorithms for online multiclass prediction

Proceedings of the 25th international conference on Machine learning
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

Theoretical Computer Science
Bandit-based optimization on graphs with application to library performance tuning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Simulation-based approach to general game playing

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 1
Achieving master level play in 9×9 computer go

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Efficient selectivity and backup operators in Monte-Carlo tree search

CG'06 Proceedings of the 5th international conference on Computers and games
Pure exploration in multi-armed bandits problems

ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
A simple distribution-free approach to the max k-armed bandit problem

CP'06 Proceedings of the 12th international conference on Principles and Practice of Constraint Programming
Bandit based monte-carlo planning

ECML'06 Proceedings of the 17th European conference on Machine Learning
Adding expert knowledge and exploration in monte-carlo tree search

ACG'09 Proceedings of the 12th international conference on Advances in Computer Games

Quantified Score

Hi-index	0.00

Visualization

Abstract

A multi-armed bandit episode consists of n trials, each allowing selection of one of K arms, resulting in payoff from a distribution over [0,1] associated with that arm. We assume contextual side information is available at the start of the episode. This context enables an arm predictor to identify possible favorable arms, but predictions may be imperfect so that they need to be combined with further exploration during the episode. Our setting is an alternative to classical multi-armed bandits which provide no contextual side information, and is also an alternative to contextual bandits which provide new context each individual trial. Multi-armed bandits with episode context can arise naturally, for example in computer Go where context is used to bias move decisions made by a multi-armed bandit algorithm. The UCB1 algorithm for multi-armed bandits achieves worst-case regret bounded by $O\left(\sqrt{Kn\log(n)}\right)$ . We seek to improve this using episode context, particularly in the case where K is large. Using a predictor that places weight M i 驴驴0 on arm i with weights summing to 1, we present the PUCB algorithm which achieves regret $O\left(\frac{1}{M_{\ast}}\sqrt{n\log(n)}\right)$ where M 驴驴驴 is the weight on the optimal arm. We illustrate the behavior of PUCB with small simulation experiments, present extensions that provide additional capabilities for PUCB, and describe methods for obtaining suitable predictors for use with PUCB.