Approximation algorithms for restless bandit problems

Authors:
Sudipto Guha;Kamesh Munagala;Peng Shi
Affiliations:
University of Pennsylvania, Philadelphia, PA;Duke University, Durham, NC;Duke University, Durham, NC
Venue:
Journal of the ACM (JACM)
Year:
2010

Citing 23
Cited 2

The weighted majority algorithm

Information and Computation
Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems

Mathematics of Operations Research
How to use expert advice

Journal of the ACM (JACM)
Minimizing service and operation costs of periodic scheduling

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
The Complexity of Optimal Queuing Network Control

Mathematics of Operations Research
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
The Approximation of Maximum Subgraph Problems

ICALP '93 Proceedings of the 20th International Colloquium on Automata, Languages and Programming
Stochastic Modeling Formalisms for Dependability, Performance and Performability

Performance Evaluation: Origins and Directions
Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic

Operations Research
Using confidence bounds for exploitation-exploration trade-offs

The Journal of Machine Learning Research
Online convex optimization in the bandit setting: gradient descent without a gradient

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Asking the right questions: model-driven optimization using probes

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Combining expert advice in reactive environments

Journal of the ACM (JACM)
Approximation algorithms for budgeted learning problems

Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
Model-driven optimization using adaptive probes

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards

FOCS '07 Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science
Approximation algorithms for restless bandit problems

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Relaxations of Weakly Coupled Stochastic Dynamic Programs

Operations Research
Planning and acting in partially observable stochastic domains

Artificial Intelligence
The stochastic machine replenishment problem

IPCO'08 Proceedings of the 13th international conference on Integer programming and combinatorial optimization
Trading in markovian price models

COLT'05 Proceedings of the 18th annual conference on Learning Theory

Matroid prophet inequalities

STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
Using emotions to enhance decision-making

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One

Quantified Score

Hi-index	0.01

Visualization

Abstract

The restless bandit problem is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit (MAB) problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to approximate to any nontrivial factor, and little progress has been made on this problem despite its significance in modeling activity allocation under uncertainty. In this article, we consider the Feedback MAB problem, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process whose exact state is only revealed when the arm is played. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is also an instance of a Partially Observable Markov Decision Process (POMDP), and is widely studied in wireless scheduling and unmanned aerial vehicle (UAV) routing. Unlike the stochastic MAB problem, the Feedback MAB problem does not admit to greedy index-based optimal policies. We develop a novel duality-based algorithmic technique that yields a surprisingly simple and intuitive (2+&epsis;)-approximate greedy policy to this problem. We show that both in terms of approximation factor and computational efficiency, our policy is closely related to the Whittle index, which is widely used for its simplicity and efficiency of computation. Subsequently we define a multi-state generalization, that we term Monotone bandits, which remains subclass of the restless bandit problem. We show that our policy remains a 2-approximation in this setting, and further, our technique is robust enough to incorporate various side-constraints such as blocking plays, switching costs, and even models where determining the state of an arm is a separate operation from playing it. Our technique is also of independent interest for other restless bandit problems, and we provide an example in nonpreemptive machine replenishment. Interestingly, in this case, our policy provides a constant factor guarantee, whereas the Whittle index is provably polynomially worse. By presenting the first O(1) approximations for nontrivial instances of restless bandits as well as of POMDPs, our work initiates the study of approximation algorithms in both these contexts.