A dynamic programming strategy to balance exploration and exploitation in the bandit problem

Authors:
Olivier Caelen;Gianluca Bontempi
Affiliations:
Computer Science Department, Université Libre de Bruxelles, Bruxelles, Belgium 1050;Computer Science Department, Université Libre de Bruxelles, Bruxelles, Belgium 1050
Venue:
Annals of Mathematics and Artificial Intelligence
Year:
2010

Citing 13
Cited 1

Dynamic programming: deterministic and stochastic models

Dynamic programming: deterministic and stochastic models
Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty

Machine Learning
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
The Sample Average Approximation Method for Stochastic Discrete Optimization

SIAM Journal on Optimization
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Exploitation vs. exploration: choosing a supplier in an environment of incomplete information

Decision Support Systems
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics)

Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics)
Improving the Exploration Strategy in Bandit Algorithms

Learning and Intelligent Optimization
Multi-armed bandit algorithms and empirical evaluation

ECML'05 Proceedings of the 16th European conference on Machine Learning

A selecting-the-best method for budgeted model selection

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

The K-armed bandit problem is a well-known formalization of the exploration versus exploitation dilemma. In this learning problem, a player is confronted to a gambling machine with K arms where each arm is associated to an unknown gain distribution. The goal of the player is to maximize the sum of the rewards. Several approaches have been proposed in literature to deal with the K-armed bandit problem. This paper introduces first the concept of "expected reward of greedy actions" which is based on the notion of probability of correct selection (PCS), well-known in simulation literature. This concept is then used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation. Experiments with a set of simulated and realistic bandit problems show that the new DP-greedy algorithm is competitive with state-of-the-art semi-uniform techniques.