Point-based policy iteration

  • Authors:
  • Shihao Ji;Ronald Parr;Hui Li;Xuejun Liao;Lawrence Carin

  • Affiliations:
  • Department of Electrical and Computer Engineering, Duke University, Durham, NC;Department of Computer Science, Duke University, Durham, NC and Department of Electrical and Computer Engineering;Department of Electrical and Computer Engineering, Duke University, Durham, NC;Department of Electrical and Computer Engineering, Duke University, Durham, NC;Department of Electrical and Computer Engineering, Duke University, Durham, NC

  • Venue:
  • AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe a point-based policy iteration (PBPI) algorithm for infinite-horizon POMDPs. PBPI replaces the exact policy improvement step of Hansen's policy iteration with point-based value iteration (PBVI). Despite being an approximate algorithm, PBPI is monotonic: At each iteration before convergence, PBPI produces a policy for which the values increase for at least one of a finite set of initial belief states, and decrease for none of these states. In contrast, PBVI cannot guarantee monotonic improvement of the value function or the policy. In practice PBPI generally needs a lower density of point coverage in the simplex and tends to produce superior policies with less computation. Experiments on several benchmark problems (up to 12,545 states) demonstrate the scalability and robustness of the PBPI algorithm.