Model-based least-squares policy evaluation

Authors:
Fletcher Lu;Dale Schuurmans
Affiliations:
School of Computer Science, University of Waterloo, Waterloo, ON, Canada;School of Computer Science, University of Waterloo, Waterloo, ON, Canada
Venue:
AI'03 Proceedings of the 16th Canadian society for computational studies of intelligence conference on Advances in artificial intelligence
Year:
2003

Citing 6
Cited 0

Linear least-squares algorithms for temporal difference learning

Machine Learning - Special issue on reinforcement learning
Reinforcement learning with replacing eligibility traces

Machine Learning - Special issue on reinforcement learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Investigating the Maximum Likelihood Alternative to TD(lambda)

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Least-Squares Temporal Difference Learning

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Computing factored value functions for policies in structured MDPs

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

A popular form of policy evaluation for large Markov Decision Processes (MDPs) is the least-squares temporal differencing( TD) method. Least-squares TD methods handle large MDPs by requiring prior knowledge feature vectors which form a set of basis vectors that compress the system down to tractable levels. Model-based methods have largely been ignored in favour of model-free TD algorithms due to two perceived drawbacks: slower computation time and larger storage requirements. This paper challenges the perceived advantage of the temporal difference method over a model-based method in three distinct ways. First, it provides a new model-based approximate policy estimation method which produces solutions in a faster computation time than Boyan's least-squares TD method. Second, it introduces a new algorithm to derive basis vectors without any prior knowledge of the system. Third, we introduce an iteratively improving model-based value estimator that can run faster than standard TD methods. All algorithms require model storage but remain computationally competitive in terms of accuracy with model-free temporal differencing methods.