Optimistic Bayesian sampling in contextual-bandit problems

Authors:
Benedict C. May;Nathan Korda;Anthony Lee;David S. Leslie
Affiliations:
School of Mathematics, University of Bristol, Bristol, United Kingdom;Oxford-Man Institute, University of Oxford, Oxford, United Kingdom;Oxford-Man Institute, University of Oxford, Oxford, United Kingdom;School of Mathematics, University of Bristol, Bristol, United Kingdom
Venue:
The Journal of Machine Learning Research
Year:
2012

Citing 19
Cited 2

Probability

Probability
Associative Reinforcement Learning: Functions in k-DNF

Machine Learning
Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty

Machine Learning
Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms

Machine Learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
A Generalized Reinforcement-Learning Model: Convergence and Applications

A Generalized Reinforcement-Learning Model: Convergence and Applications
Using confidence bounds for exploitation-exploration trade-offs

The Journal of Machine Learning Research
Bayesian sparse sampling for on-line reward optimization

ICML '05 Proceedings of the 22nd international conference on Machine learning
Prediction, Learning, and Games

Prediction, Learning, and Games
Simulation Studies of Multi-armed Bandits with Covariates (Invited Paper)

UKSIM '08 Proceedings of the Tenth International Conference on Computer Modeling and Simulation
Tuning Bandit Algorithms in Stochastic Environments

ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
A Bayesian Learning Automaton for Solving Two-Armed Bernoulli Bandit Problems

ICMLA '08 Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

Theoretical Computer Science
A contextual-bandit approach to personalized news article recommendation

Proceedings of the 19th international conference on World wide web
A Bayesian sampling approach to exploration in reinforcement learning

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Proceedings of the fourth ACM international conference on Web search and data mining
A modern Bayesian look at the multi-armed bandit

Applied Stochastic Models in Business and Industry
Regret Bounds and Minimax Policies under Partial Monitoring

The Journal of Machine Learning Research

Thompson sampling: an asymptotically optimal finite-time analysis

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Self-Avoiding Random Dynamics on Integer Complex Systems

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special Issue on Monte Carlo Methods in Statistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson's method throughout.