Q-Learning for Bandit Problems

  • Authors:
  • M. Duff

  • Affiliations:
  • -

  • Venue:
  • Q-Learning for Bandit Problems
  • Year:
  • 1995

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multi-armed bandits may be viewed as decompositionally-structured Markov decision processes (MDP''s) with potentially very large state sets. A particularly elegant methodology for computing optimal policies was developed over twenty ago by Gittins [Gittins \& Jones, 1974]. Gittins'' approach reduces the problem of finding optimal policies for the original MDP to a sequence of low-dimensional stopping problems whose solutions determine the optimal policy through the so-called ``Gittins indices.'''' Katehakis and Veinott [Katehakis \& Veinott, 1987] have shown that the Gittins index for a task in state $i$ may be interpreted as a particular component of the maximum-value function associated with the ``restart-in-$i$'''' process, a simple MDP to which standard solution methods for computing optimal policies, such as successive approximation, apply. This paper explores the problem of learning the Gittins indices on-line without the aid of a process model; it suggests utilizing task-state-specific Q-learning agents to solve their respective restart-in-state-$i$ subproblems, and includes an example in which the online reinforcement learning approach is applied to a simple problem of stochastic scheduling---one instance drawn from a wide class of problems that may be formulated as bandit problems.