Improving the Exploration Strategy in Bandit Algorithms

Authors:
Olivier Caelen;Gianluca Bontempi
Affiliations:
Machine Learning Group, Département d'Informatique, Faculté des Sciences, Université Libre de Bruxelles, Bruxelles, Belgium;Machine Learning Group, Département d'Informatique, Faculté des Sciences, Université Libre de Bruxelles, Bruxelles, Belgium
Venue:
Learning and Intelligent Optimization
Year:
2008

Citing 7
Cited 3

Dynamic programming: deterministic and stochastic models

Dynamic programming: deterministic and stochastic models
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Selecting the best system: selecting the best system: theory and methods

Proceedings of the 35th conference on Winter simulation: driving innovation
Exploitation vs. exploration: choosing a supplier in an environment of incomplete information

Decision Support Systems
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
Multi-armed bandit algorithms and empirical evaluation

ECML'05 Proceedings of the 16th European conference on Machine Learning

Adaptive ε-greedy exploration in reinforcement learning based on value differences

KI'10 Proceedings of the 33rd annual German conference on Advances in artificial intelligence
A dynamic programming strategy to balance exploration and exploitation in the bandit problem

Annals of Mathematics and Artificial Intelligence
A selecting-the-best method for budgeted model selection

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

The K-armed bandit problem is a formalization of the explorationversus exploitation dilemma, a well-known issue in stochasticoptimization tasks. In a K-armed bandit problem, a player isconfronted with a gambling machine with K arms where each arm isassociated to an unknown gain distribution and the goal is tomaximize the sum of the rewards (or minimize the sum of losses).Several approaches have been proposed in literature to deal withthe K-armed bandit problem. Most of them combine a greedyexploitation strategy with a random exploratory phase. This paperfocuses on the improvement of the exploration step by havingrecourse to the notion of probability of correct selection (PCS), awell-known notion in the simulation literature yet overlooked inthe optimization domain. The rationale of our approach is toperform at each exploration step the arm sampling which maximizesthe probability of selecting the optimal arm (i.e. the PCS) at thefollowing step. This strategy is implemented by a bandit algorithm,called ε-PCSgreedy, which integrates the PCS explorationapproach with the classical ε-greedy schema. A set ofnumerical experiments on artificial and real datasets shows that amore effective exploration may improve the performance of theentire bandit strategy.