Partially Observed Markov Decision Process Multiarmed Bandits---Structural Results

Authors:
Vikram Krishnamurthy;Bo Wahlberg
Affiliations:
Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada;Automatic Control and ACCESS, School of Electrical Engineering, KTH, SE-100 44 Stockholm, Sweden
Venue:
Mathematics of Operations Research
Year:
2009

Citing 9
Cited 3

Stochastic systems: estimation, identification and adaptive control

Stochastic systems: estimation, identification and adaptive control
Some monotonicity results for partially observed Markov decision processes

Operations Research
Computationally feasible bounds for partially observed Markov decision processes

Operations Research
A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems

Mathematics of Operations Research
Introduction to Stochastic Dynamic Programming: Probability and Mathematical

Introduction to Stochastic Dynamic Programming: Probability and Mathematical
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
Incremental pruning: a simple, fast, exact method for partially observable Markov decision processes

UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence
Structured Threshold Policies for Dynamic Sensor Scheduling—A Partially Observed Markov Decision Process Approach

IEEE Transactions on Signal Processing

Optimal Threshold Policies for Multivariate Stopping-Time POMDPs

ECSQARU '09 Proceedings of the 10th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty
Optimal threshold policies for multivariate POMDPs in radar resource management

IEEE Transactions on Signal Processing
Distributed node selection for threshold key management with intrusion detection in mobile ad hoc networks

Wireless Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper considers multiarmed bandit problems involving partially observed Markov decision processes (POMDPs). We show how the Gittins index for the optimal scheduling policy can be computed by a value iteration algorithm on each process, thereby considerably simplifying the computational cost. A suboptimal value iteration algorithm based on Lovejoy's approximation is presented. We then show that for the case of totally positive of order 2 (TP2) transition probability matrices and monotone likelihood ratio (MLR) ordered observation probabilities, the Gittins index is MLR increasing in the information state. Algorithms that exploit this structure are then presented.