Acting optimally in partially observable stochastic domains
AAAI'94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 2)
An improved policy iteration algorithm for partially observable MDPs
NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Bias and variance in value function estimation
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
The Journal of Machine Learning Research
Partially observable Markov decision processes for spoken dialog systems
Computer Speech and Language
Efficient model learning for dialog management
Proceedings of the ACM/IEEE international conference on Human-robot interaction
Bias and Variance Approximation in Value Function Estimates
Management Science
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Solving POMDPs by searching in policy space
UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
Journal of Theoretical and Applied Electronic Commerce Research
Hi-index | 0.00 |
Partially Observable Markov Decision Processes have been studied widely as a model for decision making under uncertainty, and a number of methods have been developed to find the solutions for such processes. Such studies often involve calculation of the value function of a specific policy, given a model of the transition and observation probabilities, and the reward. These models can be learned using labeled samples of on-policy trajectories. However, when using empirical models, some bias and variance terms are introduced into the value function as a result of imperfect models. In this paper, we propose a method for estimating the bias and variance of the value function in terms of the statistics of the empirical transition and observation model. Such error terms can be used to meaningfully compare the value of different policies. This is an important result for sequential decision-making, since it will allow us to provide more formal guarantees about the quality of the policies we implement. To evaluate the precision of the proposed method, we provide supporting experiments on problems from the field of robotics and medical decision making.