Dynamic programming: deterministic and stochastic models
Dynamic programming: deterministic and stochastic models
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
Selecting the best system: selecting the best system: theory and methods
Proceedings of the 35th conference on Winter simulation: driving innovation
Exploitation vs. exploration: choosing a supplier in an environment of incomplete information
Decision Support Systems
Reinforcement learning: a survey
Journal of Artificial Intelligence Research
Multi-armed bandit algorithms and empirical evaluation
ECML'05 Proceedings of the 16th European conference on Machine Learning
Adaptive ε-greedy exploration in reinforcement learning based on value differences
KI'10 Proceedings of the 33rd annual German conference on Advances in artificial intelligence
A dynamic programming strategy to balance exploration and exploitation in the bandit problem
Annals of Mathematics and Artificial Intelligence
A selecting-the-best method for budgeted model selection
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
Hi-index | 0.00 |
The K-armed bandit problem is a formalization of the explorationversus exploitation dilemma, a well-known issue in stochasticoptimization tasks. In a K-armed bandit problem, a player isconfronted with a gambling machine with K arms where each arm isassociated to an unknown gain distribution and the goal is tomaximize the sum of the rewards (or minimize the sum of losses).Several approaches have been proposed in literature to deal withthe K-armed bandit problem. Most of them combine a greedyexploitation strategy with a random exploratory phase. This paperfocuses on the improvement of the exploration step by havingrecourse to the notion of probability of correct selection (PCS), awell-known notion in the simulation literature yet overlooked inthe optimization domain. The rationale of our approach is toperform at each exploration step the arm sampling which maximizesthe probability of selecting the optimal arm (i.e. the PCS) at thefollowing step. This strategy is implemented by a bandit algorithm,called ε-PCSgreedy, which integrates the PCS explorationapproach with the classical ε-greedy schema. A set ofnumerical experiments on artificial and real datasets shows that amore effective exploration may improve the performance of theentire bandit strategy.