Computing an index policy for multiarmed bandits with deadlines

  • Authors:
  • José Niño-Mora

  • Affiliations:
  • Universidad Carlos III de Madrid, Leganés (Madrid), Spain

  • Venue:
  • Proceedings of the 3rd International Conference on Performance Evaluation Methodologies and Tools
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces the multiarmed bandit problem with deadlines, which concerns the dynamic selection of a live project to engage out of a portfolio of Markovian bandit projects expiring after given deadlines, to maximize the expected total discounted or undiscounted reward earned. Although the problem is computationally intractable, a natural heuristic policy is obtained by attaching to each project the finite-horizon counterpart of its Gittins index, and then engaging at each time a live project of highest index. Remarkably, while such a finite-horizon index was introduced in [R. N. Bradt, S. M. Johnson, and S. Karlin (1956). On sequential designs to maximize the sum of n observations. Ann. Math. Statist. 27 1060--1074], an exact polynomialtime algorithm using arithmetic operations does not seem to have been proposed until [J. Niño-Mora (2005). A marginal productivity index policy for the finite-horizon multiarmed bandit problem. In Proceedings of CDC-ECC '05, pp. 1718--1722, IEEE]. Yet, such an adaptive-greedy index algorithm, which draws on methods introduced by the author for restless bandit indexation, has a complexity of O(T3n3) operations for a T-horizon n-state project, rendering it impractical for all but small instances. This paper significantly improves on the complexity of such an algorithm, decoupling it into a recursive T-stage method that performs O(T2n3) arithmetic operations. Moreover, in an insightful special model the complexity is further reduced to O(T2) operations, and closed-form index formulae are given. Computational experiments are reported demonstrating the algorithm's runtime performance, and showing that the proposed index policy is near optimal and can substantially outperform the benchmark greedy and Gittins index policies.