Computing an index policy for multiarmed bandits with deadlines

Authors:
José Niño-Mora
Affiliations:
Universidad Carlos III de Madrid, Leganés (Madrid), Spain
Venue:
Proceedings of the 3rd International Conference on Performance Evaluation Methodologies and Tools
Year:
2008

Citing 5
Cited 2

Dynamic Assortment with Demand Learning for Seasonal Consumer Goods

Management Science
Restless Bandit Marginal Productivity Indices, Diminishing Returns, and Optimal Control of Make-to-Order/Make-to-Stock M/G/1 Queues

Mathematics of Operations Research
Characterization and computation of restless bandit marginal productivity indices

Proceedings of the 2nd international conference on Performance evaluation methodologies and tools
A (2/3)n3 Fast-Pivoting Algorithm for the Gittins Index and Optimal Stopping of a Markov Chain

INFORMS Journal on Computing
A Faster Index Algorithm and a Computational Study for Bandits with Switching Costs

INFORMS Journal on Computing

Computing a Classic Index for Finite-Horizon Bandits

INFORMS Journal on Computing
Computing a Classic Index for Finite-Horizon Bandits

INFORMS Journal on Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces the multiarmed bandit problem with deadlines, which concerns the dynamic selection of a live project to engage out of a portfolio of Markovian bandit projects expiring after given deadlines, to maximize the expected total discounted or undiscounted reward earned. Although the problem is computationally intractable, a natural heuristic policy is obtained by attaching to each project the finite-horizon counterpart of its Gittins index, and then engaging at each time a live project of highest index. Remarkably, while such a finite-horizon index was introduced in [R. N. Bradt, S. M. Johnson, and S. Karlin (1956). On sequential designs to maximize the sum of n observations. Ann. Math. Statist. 27 1060--1074], an exact polynomialtime algorithm using arithmetic operations does not seem to have been proposed until [J. Niño-Mora (2005). A marginal productivity index policy for the finite-horizon multiarmed bandit problem. In Proceedings of CDC-ECC '05, pp. 1718--1722, IEEE]. Yet, such an adaptive-greedy index algorithm, which draws on methods introduced by the author for restless bandit indexation, has a complexity of O(T3n3) operations for a T-horizon n-state project, rendering it impractical for all but small instances. This paper significantly improves on the complexity of such an algorithm, decoupling it into a recursive T-stage method that performs O(T2n3) arithmetic operations. Moreover, in an insightful special model the complexity is further reduced to O(T2) operations, and closed-form index formulae are given. Computational experiments are reported demonstrating the algorithm's runtime performance, and showing that the proposed index policy is near optimal and can substantially outperform the benchmark greedy and Gittins index policies.