Closing the learning-planning loop with predictive state representations

Authors:
Byron Boots;Sajid M. Siddiqi;Geoffrey J. Gordon
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1
Year:
2010

Citing 9
Cited 1

Acting optimally in partially observable stochastic domains

AAAI'94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 2)
Learning low dimensional predictive representations

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Learning predictive representations from a history

ICML '05 Proceedings of the 22nd international conference on Machine learning
Observable Operator Models for Discrete Stochastic Time Series

Neural Computation
Learning predictive state representations using non-blind policies

ICML '06 Proceedings of the 23rd international conference on Machine learning
A bound on modeling error in observable operator models and an associated learning algorithm

Neural Computation
Improving approximate value iteration using memories and predictive state representations

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Point-based value iteration: an anytime algorithm for POMDPs

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Point-based planning for predictive state representations

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence

A Monte-Carlo AIXI approximation

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

A central problem in artificial intelligence is to plan to maximize future reward under uncertainty in a partially observable environment. Models of such environments include Partially Observable Markov Decision Processes (POMDPs) [4] as well as their generalizations, Predictive State Representations (PSRs) [9] and Observable Operator Models (OOMs) [7]. POMDPs model the state of the world as a latent variable; in contrast, PSRs and OOMs represent state by tracking occurrence probabilities of a set of future events (called tests or characteristic events) conditioned on past events (called histories or indicative events). Unfortunately, exact planning algorithms such as value iteration [14] are intractable for most realistic POMDPs due to the curse of history and the curse of dimensionality [11]. However, PSRs and OOMs hold the promise of mitigating both of these curses: first, many successful approximate planning techniques designed to address these problems in POMDPs can easily be adapted to PSRs and OOMs [8, 6]. Second, PSRs and OOMs are often more compact than their corresponding POMDPs (i.e., need fewer state dimensions), mitigating the curse of dimensionality. Finally, since tests and histories are observable quantities, it has been suggested that PSRs and OOMs should be easier to learn than POMDPs; with a successful learning algorithm, we can look for a model which ignores all but the most important components of state, reducing dimensionality still further. In this paper we take an important step toward realizing the above hopes. In particular, we propose and demonstrate a fast and statistically consistent spectral algorithm which learns the parameters of a PSR directly from sequences of action-observation pairs. We then close the loop from observations to actions by planning in the learned model and recovering a policy which is near-optimal in the original environment. Closing the loop is a much more stringent test than simply checking short-term prediction accuracy, since the quality of an optimized policy depends strongly on the accuracy of the model: inaccurate models typically lead to useless plans.