Using rewards for belief state updates in partially observable markov decision processes

Authors:
Masoumeh T. Izadi;Doina Precup
Affiliations:
School of Computer Science, McGill University, Montreal, QC, Canada;School of Computer Science, McGill University, Montreal, QC, Canada
Venue:
ECML'05 Proceedings of the 16th European conference on Machine Learning
Year:
2005

Citing 7
Cited 0

An epsilon-Optimal Grid-Based Algorithm for Partially Observable Markov Decision Processes

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Equivalence notions and model minimization in Markov decision processes

Artificial Intelligence - special issue on planning with uncertainty and incomplete information
Heuristic search value iteration for POMDPs

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Value-function approximations for partially observable Markov decision processes

Journal of Artificial Intelligence Research
Point-based value iteration: an anytime algorithm for POMDPs

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Using core beliefs for point-based value iteration

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Incremental pruning: a simple, fast, exact method for partially observable Markov decision processes

UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Partially Observable Markov Decision Processes (POMDP) provide a standard framework for sequential decision making in stochastic environments. In this setting, an agent takes actions and receives observations and rewards from the environment. Many POMDP solution methods are based on computing a belief state, which is a probability distribution over possible states in which the agent could be. The action choice of the agent is then based on the belief state. The belief state is computed based on a model of the environment, and the history of actions and observations seen by the agent. However, reward information is not taken into account in updating the belief state. In this paper, we argue that rewards can carry useful information that can help disambiguate the hidden state. We present a method for updating the belief state which takes rewards into account. We present experiments with exact and approximate planning methods on several standard POMDP domains, using this belief update method, and show that it can provide advantages, both in terms of speed and in terms of the quality of the solution obtained.