Acting optimally in partially observable stochastic domains
AAAI'94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 2)
Learning low dimensional predictive representations
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Learning predictive representations from a history
ICML '05 Proceedings of the 22nd international conference on Machine learning
Observable Operator Models for Discrete Stochastic Time Series
Neural Computation
Learning predictive state representations using non-blind policies
ICML '06 Proceedings of the 23rd international conference on Machine learning
Improving approximate value iteration using memories and predictive state representations
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Point-based value iteration: an anytime algorithm for POMDPs
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Point-based planning for predictive state representations
Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
A Monte-Carlo AIXI approximation
Journal of Artificial Intelligence Research
Hi-index | 0.00 |
A central problem in artificial intelligence is to plan to maximize future reward under uncertainty in a partially observable environment. Models of such environments include Partially Observable Markov Decision Processes (POMDPs) [4] as well as their generalizations, Predictive State Representations (PSRs) [9] and Observable Operator Models (OOMs) [7]. POMDPs model the state of the world as a latent variable; in contrast, PSRs and OOMs represent state by tracking occurrence probabilities of a set of future events (called tests or characteristic events) conditioned on past events (called histories or indicative events). Unfortunately, exact planning algorithms such as value iteration [14] are intractable for most realistic POMDPs due to the curse of history and the curse of dimensionality [11]. However, PSRs and OOMs hold the promise of mitigating both of these curses: first, many successful approximate planning techniques designed to address these problems in POMDPs can easily be adapted to PSRs and OOMs [8, 6]. Second, PSRs and OOMs are often more compact than their corresponding POMDPs (i.e., need fewer state dimensions), mitigating the curse of dimensionality. Finally, since tests and histories are observable quantities, it has been suggested that PSRs and OOMs should be easier to learn than POMDPs; with a successful learning algorithm, we can look for a model which ignores all but the most important components of state, reducing dimensionality still further. In this paper we take an important step toward realizing the above hopes. In particular, we propose and demonstrate a fast and statistically consistent spectral algorithm which learns the parameters of a PSR directly from sequences of action-observation pairs. We then close the loop from observations to actions by planning in the learned model and recovering a policy which is near-optimal in the original environment. Closing the loop is a much more stringent test than simply checking short-term prediction accuracy, since the quality of an optimized policy depends strongly on the accuracy of the model: inaccurate models typically lead to useless plans.