A variance analysis for POMDP policy evaluation

Authors:
Mahdi Milani Fard;Joelle Pineau;Peng Sun
Affiliations:
School of Computer Science, McGill University, Montreal, Canada;School of Computer Science, McGill University, Montreal, Canada;Fuqua School of Business, Duke University, Durham
Venue:
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Year:
2008

Citing 9
Cited 1

Acting optimally in partially observable stochastic domains

AAAI'94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 2)
An improved policy iteration algorithm for partially observable MDPs

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Bias and variance in value function estimation

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

The Journal of Machine Learning Research
Partially observable Markov decision processes for spoken dialog systems

Computer Speech and Language
Efficient model learning for dialog management

Proceedings of the ACM/IEEE international conference on Human-robot interaction
Bias and Variance Approximation in Value Function Estimates

Management Science
Point-based policy iteration

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Solving POMDPs by searching in policy space

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

Interoperability and information brokers in public safety: an approach toward seamless emergency communications

Journal of Theoretical and Applied Electronic Commerce Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Partially Observable Markov Decision Processes have been studied widely as a model for decision making under uncertainty, and a number of methods have been developed to find the solutions for such processes. Such studies often involve calculation of the value function of a specific policy, given a model of the transition and observation probabilities, and the reward. These models can be learned using labeled samples of on-policy trajectories. However, when using empirical models, some bias and variance terms are introduced into the value function as a result of imperfect models. In this paper, we propose a method for estimating the bias and variance of the value function in terms of the statistics of the empirical transition and observation model. Such error terms can be used to meaningfully compare the value of different policies. This is an important result for sequential decision-making, since it will allow us to provide more formal guarantees about the quality of the policies we implement. To evaluate the precision of the proposed method, we provide supporting experiments on problems from the field of robotics and medical decision making.