A worst-case comparison between temporal difference and residual gradient with linear function approximation

Authors:
Lihong Li
Affiliations:
Rutgers University, Piscataway, NJ
Venue:
Proceedings of the 25th international conference on Machine learning
Year:
2008

Citing 10
Cited 1

Matrix analysis

Matrix analysis
On the worst-case analysis of temporal-difference learning algorithms

Machine Learning - Special issue on reinforcement learning
Exponentiated gradient versus gradient descent for linear predictors

Information and Computation
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Exponentiated Gradient Methods for Reinforcement Learning

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Least-squares policy iteration

The Journal of Machine Learning Research
Worst-case quadratic loss bounds for prediction using linear functions and gradient descent

IEEE Transactions on Neural Networks

Hybrid least-squares algorithms for approximate policy evaluation

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Residual gradient (RG) was proposed as an alternative to TD(0) for policy evaluation when function approximation is used, but there exists little formal analysis comparing them except in very limited cases. This paper employs techniques from online learning of linear functions and provides a worst-case (non-probabilistic) analysis to compare these two types of algorithms when linear function approximation is used. No statistical assumptions are made on the sequence of observations, so the analysis applies to non-Markovian and even adversarial domains as well. In particular, our results suggest that RG may result in smaller temporal differences, while TD(0) is more likely to yield smaller prediction errors. These phenomena can be observed even in two simple Markov chain examples that are non-adversarial.