Tuning Bandit Algorithms in Stochastic Environments

Authors:
Jean-Yves Audibert;Rémi Munos;Csaba Szepesvári
Affiliations:
CERTIS - Ecole des Ponts, 19, rue Alfred Nobel - Cité Descartes, 77455 Marne-la-Vallée, France;INRIA Futurs Lille, SequeL project, 40 avenue Halley, 59650 Villeneuve d'Ascq, France;University of Alberta, Edmonton T6G 2E8, Canada
Venue:
ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Year:
2007

Citing 1
Cited 10

Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning

Reward-modulated hebbian learning of decision making

Neural Computation
New uncertainty handling strategies in multi-objective evolutionary optimization

PPSN'10 Proceedings of the 11th international conference on Parallel problem solving from nature: Part II
Designing artificial tetris players with evolution strategies and racing

Proceedings of the 13th annual conference companion on Genetic and evolutionary computation
On upper-confidence bound policies for switching bandit problems

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Sub-sampling: Real-time vision for micro air vehicles

Robotics and Autonomous Systems
Automatic discovery of ranking formulas for playing with multi-armed bandits

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Optimistic Bayesian sampling in contextual-bandit problems

The Journal of Machine Learning Research
Reducing the learning time of tetris in evolution strategies

EA'11 Proceedings of the 10th international conference on Artificial Evolution
Mixing bandits: a recipe for improved cold-start recommendations in a social network

Proceedings of the 7th Workshop on Social Network Mining and Analysis
Counterfactual reasoning and learning systems: the example of computational advertising

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Algorithms based on upper-confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. In this paper we consider a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. The purpose of this paper is to provide a theoretical explanation of these findings and provide theoretical guidelines for the tuning of the parameters of these algorithms. For this we analyze the expected regret and for the first time the concentration of the regret. The analysis of the expected regret shows that variance estimates can be especially advantageous when the payoffs of suboptimal arms have low variance. The risk analysis, rather unexpectedly, reveals that except for some very special bandit problems, the regret, for upper confidence bounds based algorithms with standard bias sequences, concentrates only at a polynomial rate. Hence, although these algorithms achieve logarithmic expected regret rates, they seem less attractive when the risk of suffering much worse than logarithmic regret is also taken into account.