Run-time improvement of point-based POMDP policies

Authors:
Minlue Wang;Richard Dearden
Affiliations:
School of Computer Science, University of Birmingham, Birmingham, UK;School of Computer Science, University of Birmingham, Birmingham, UK
Venue:
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Year:
2013

Citing 6
Cited 0

Acting optimally in partially observable stochastic domains

AAAI'94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 2)
Heuristic search value iteration for POMDPs

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Perseus: randomized point-based value iteration for POMDPs

Journal of Artificial Intelligence Research
Anytime point-based approximations for large POMDPs

Journal of Artificial Intelligence Research
Online planning algorithms for POMDPs

Journal of Artificial Intelligence Research
AEMS: an anytime online search algorithm for approximate policy refinement in large POMDPs

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The most successful recent approaches to partially observable Markov decision problem (POMDP) solving have largely been point-based approximation algorithms. These work by selecting a finite number of belief points, computing alpha-vectors for those points, and using the resulting policy everywhere. However, if during execution the belief state is far from the points, there is no guarantee that the policy will be good. This case occurs either when the points are chosen poorly or there are too few points to capture the whole optimal policy, for example in domains where there are many low probability transitions, such as faults or exogenous events. In this paper we explore the use of an on-line plan repair approach to overcome this difficulty. The idea is to split computation between off-line plan creation and, if necessary, on-line plan repair. We evaluate a variety of heuristics used to determine when plan repair might be useful, and then repair the plan by sampling a small number of additional belief points and recomputing the policy. We show in several domains that the approach is more effective than either off-line planning alone even with much more computation time, or a purely on-line planning based on forward search. We also show that the overhead of checking the heuristics is very small when replanning is unnecessary.