The weighted majority algorithm
Information and Computation
Mathematics of Operations Research
Journal of the ACM (JACM)
Minimizing service and operation costs of periodic scheduling
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
The Complexity of Optimal Queuing Network Control
Mathematics of Operations Research
Dynamic Programming and Optimal Control
Dynamic Programming and Optimal Control
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
The Approximation of Maximum Subgraph Problems
ICALP '93 Proceedings of the 20th International Colloquium on Automata, Languages and Programming
Stochastic Modeling Formalisms for Dependability, Performance and Performability
Performance Evaluation: Origins and Directions
Using confidence bounds for exploitation-exploration trade-offs
The Journal of Machine Learning Research
Online convex optimization in the bandit setting: gradient descent without a gradient
SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Asking the right questions: model-driven optimization using probes
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Combining expert advice in reactive environments
Journal of the ACM (JACM)
Approximation algorithms for budgeted learning problems
Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
Model-driven optimization using adaptive probes
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards
FOCS '07 Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science
Approximation algorithms for restless bandit problems
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Relaxations of Weakly Coupled Stochastic Dynamic Programs
Operations Research
Planning and acting in partially observable stochastic domains
Artificial Intelligence
The stochastic machine replenishment problem
IPCO'08 Proceedings of the 13th international conference on Integer programming and combinatorial optimization
Trading in markovian price models
COLT'05 Proceedings of the 18th annual conference on Learning Theory
STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
Using emotions to enhance decision-making
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One
Hi-index | 0.01 |
The restless bandit problem is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit (MAB) problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to approximate to any nontrivial factor, and little progress has been made on this problem despite its significance in modeling activity allocation under uncertainty. In this article, we consider the Feedback MAB problem, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process whose exact state is only revealed when the arm is played. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is also an instance of a Partially Observable Markov Decision Process (POMDP), and is widely studied in wireless scheduling and unmanned aerial vehicle (UAV) routing. Unlike the stochastic MAB problem, the Feedback MAB problem does not admit to greedy index-based optimal policies. We develop a novel duality-based algorithmic technique that yields a surprisingly simple and intuitive (2+&epsis;)-approximate greedy policy to this problem. We show that both in terms of approximation factor and computational efficiency, our policy is closely related to the Whittle index, which is widely used for its simplicity and efficiency of computation. Subsequently we define a multi-state generalization, that we term Monotone bandits, which remains subclass of the restless bandit problem. We show that our policy remains a 2-approximation in this setting, and further, our technique is robust enough to incorporate various side-constraints such as blocking plays, switching costs, and even models where determining the state of an arm is a separate operation from playing it. Our technique is also of independent interest for other restless bandit problems, and we provide an example in nonpreemptive machine replenishment. Interestingly, in this case, our policy provides a constant factor guarantee, whereas the Whittle index is provably polynomially worse. By presenting the first O(1) approximations for nontrivial instances of restless bandits as well as of POMDPs, our work initiates the study of approximation algorithms in both these contexts.