Q-Learning for Bandit Problems

Authors:
M. Duff
Affiliations:
-
Venue:
Q-Learning for Bandit Problems
Year:
1995

Citing 0
Cited 1

Optimal learning of transition probabilities in the two-agent newsvendor problem

Proceedings of the Winter Simulation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multi-armed bandits may be viewed as decompositionally-structured Markov decision processes (MDP''s) with potentially very large state sets. A particularly elegant methodology for computing optimal policies was developed over twenty ago by Gittins [Gittins \& Jones, 1974]. Gittins'' approach reduces the problem of finding optimal policies for the original MDP to a sequence of low-dimensional stopping problems whose solutions determine the optimal policy through the so-called ``Gittins indices.'''' Katehakis and Veinott [Katehakis \& Veinott, 1987] have shown that the Gittins index for a task in state $i$ may be interpreted as a particular component of the maximum-value function associated with the ``restart-in-$i$'''' process, a simple MDP to which standard solution methods for computing optimal policies, such as successive approximation, apply. This paper explores the problem of learning the Gittins indices on-line without the aid of a process model; it suggests utilizing task-state-specific Q-learning agents to solve their respective restart-in-state-$i$ subproblems, and includes an example in which the online reinforcement learning approach is applied to a simple problem of stochastic scheduling---one instance drawn from a wide class of problems that may be formulated as bandit problems.