Improving the Exploration Strategy in Bandit Algorithms

  • Authors:
  • Olivier Caelen;Gianluca Bontempi

  • Affiliations:
  • Machine Learning Group, Département d'Informatique, Faculté des Sciences, Université Libre de Bruxelles, Bruxelles, Belgium;Machine Learning Group, Département d'Informatique, Faculté des Sciences, Université Libre de Bruxelles, Bruxelles, Belgium

  • Venue:
  • Learning and Intelligent Optimization
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The K-armed bandit problem is a formalization of the explorationversus exploitation dilemma, a well-known issue in stochasticoptimization tasks. In a K-armed bandit problem, a player isconfronted with a gambling machine with K arms where each arm isassociated to an unknown gain distribution and the goal is tomaximize the sum of the rewards (or minimize the sum of losses).Several approaches have been proposed in literature to deal withthe K-armed bandit problem. Most of them combine a greedyexploitation strategy with a random exploratory phase. This paperfocuses on the improvement of the exploration step by havingrecourse to the notion of probability of correct selection (PCS), awell-known notion in the simulation literature yet overlooked inthe optimization domain. The rationale of our approach is toperform at each exploration step the arm sampling which maximizesthe probability of selecting the optimal arm (i.e. the PCS) at thefollowing step. This strategy is implemented by a bandit algorithm,called ε-PCSgreedy, which integrates the PCS explorationapproach with the classical ε-greedy schema. A set ofnumerical experiments on artificial and real datasets shows that amore effective exploration may improve the performance of theentire bandit strategy.