Partially Observed Markov Decision Process Multiarmed Bandits---Structural Results

  • Authors:
  • Vikram Krishnamurthy;Bo Wahlberg

  • Affiliations:
  • Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada;Automatic Control and ACCESS, School of Electrical Engineering, KTH, SE-100 44 Stockholm, Sweden

  • Venue:
  • Mathematics of Operations Research
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper considers multiarmed bandit problems involving partially observed Markov decision processes (POMDPs). We show how the Gittins index for the optimal scheduling policy can be computed by a value iteration algorithm on each process, thereby considerably simplifying the computational cost. A suboptimal value iteration algorithm based on Lovejoy's approximation is presented. We then show that for the case of totally positive of order 2 (TP2) transition probability matrices and monotone likelihood ratio (MLR) ordered observation probabilities, the Gittins index is MLR increasing in the information state. Algorithms that exploit this structure are then presented.