A semiparametric statistical approach to model-free policy evaluation

Authors:
Tsuyoshi Ueno;Motoaki Kawanabe;Takeshi Mori;Shin-ichi Maeda;Shin Ishii
Affiliations:
Kyoto University, Kyoto, Japan;Fraunhofer FIRST, IDA, Berlin, Germany;Kyoto University, Kyoto, Japan;Kyoto University, Kyoto, Japan;Kyoto University, Kyoto, Japan
Venue:
Proceedings of the 25th international conference on Machine learning
Year:
2008

Citing 9
Cited 2

Recursive estimation and time-series analysis: an introduction

Recursive estimation and time-series analysis: an introduction
Matrix analysis

Matrix analysis
Linear least-squares algorithms for temporal difference learning

Machine Learning - Special issue on reinforcement learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Least-squares policy iteration

The Journal of Machine Learning Research
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

The Journal of Machine Learning Research
Bias and Variance Approximation in Value Function Estimates

Management Science
Natural actor-critic

ECML'05 Proceedings of the 16th European conference on Machine Learning

Optimal Online Learning Procedures for Model-Free Policy Evaluation

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reinforcement learning (RL) methods based on least-squares temporal difference (LSTD) have been developed recently and have shown good practical performance. However, the quality of their estimation has not been well elucidated. In this article, we discuss LSTD-based policy evaluation from the new view-point of semiparametric statistical inference. In fact, the estimator can be obtained from a particular estimating function which guarantees its convergence to the true value asymptotically, without specifying a model of the environment. Based on these observations, we 1) analyze the asymptotic variance of an LSTD-based estimator, 2) derive the optimal estimating function with the minimum asymptotic estimation variance, and 3) derive a suboptimal estimator to reduce the computational burden in obtaining the optimal estimating function.