Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards

Authors:
Sudipto Guha;Kamesh Munagala
Affiliations:
-;-
Venue:
FOCS '07 Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science
Year:
2007

Citing 0
Cited 10

A Near Optimal Policy for Channel Allocation in Cognitive Radio

Recent Advances in Reinforcement Learning
Approximation algorithms for restless bandit problems

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Optimality of myopic sensing in multichannel opportunistic access

IEEE Transactions on Information Theory
The Lagrangian relaxation based resources allocation methods for air-to-ground operations under uncertainty circumstances

CCDC'09 Proceedings of the 21st annual international conference on Chinese control and decision conference
Multi-channel opportunistic access: a case of restless bandits with multiple plays

Allerton'09 Proceedings of the 47th annual Allerton conference on Communication, control, and computing
Approximation algorithms for restless bandit problems

Journal of the ACM (JACM)
Sharp dichotomies for regret minimization in metric spaces

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
When LP is the cure for your matching woes: improved bounds for stochastic matchings

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Indexability of restless bandit problems and optimality of Whittle index for dynamic multichannel access

IEEE Transactions on Information Theory
Adaptive crowdsourcing for temporal crowds

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.12

Visualization

Abstract

We consider a variant of the classic multi-armed bandit problem (MAB), which we call FEEDBACK MAB, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process with known parameters. The evolution of the Markov chain happens irrespective of whether the arm is played, and furthermore, the exact state of the Markov chain is only revealed to the player when the arm is played and the reward observed. At most one arm (or in general, M arms) can be played any time step. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is an instance of a Partially Observable Markov Decision Process (POMDP), and a special case of the notoriously intractable "restless bandit" problem. Unlike the stochastic MAB problem, the FEEDBACK MAB problem does not admit to greedy index-based optimal policies. The state of the system at any time step encodes the beliefs about the states of different arms, and the policy decisions change these beliefs - this aspect complicates the design and analysis of simple algorithms.We design a constant factor approximation to the FEEDBACK MAB problem by solving and rounding a natural LP relaxation to this problem. As far as we are aware, this is the first approximation algorithm for a POMDP problem.