Least Squares Policy Evaluation Algorithms with Linear Function Approximation

Authors:
A. Nedić;D. P. Bertsekas
Affiliations:
Department of Electrical Engineering and Computer Science, M.I.T., Cambridge, MA 02139;Department of Electrical Engineering and Computer Science, M.I.T., Cambridge, MA 02139
Venue:
Discrete Event Dynamic Systems
Year:
2003

Citing 10
Cited 18

TD(λ) Converges with Probability 1

Machine Learning
A counterexample to temporal differences learning

Neural Computation
Linear least-squares algorithms for temporal difference learning

Machine Learning - Special issue on reinforcement learning
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
On the Convergence of Temporal-Difference Learning with Linear Function Approximation

Machine Learning
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Gradient Convergence in Gradient methods with Errors

SIAM Journal on Optimization
Technical Update: Least-Squares Temporal Difference Learning

Machine Learning
Learning to Predict by the Methods of Temporal Differences

Machine Learning

Least-squares policy iteration

The Journal of Machine Learning Research
Asymptotic analysis of temporal-difference learning algorithms with constant step-sizes

Machine Learning
Performance Loss Bounds for Approximate Value Iteration with State Aggregation

Mathematics of Operations Research
A formal framework and extensions for function approximation in learning classifier systems

Machine Learning
Dynamic modeling and control of supply chain systems: A review

Computers and Operations Research
Preconditioned temporal difference learning

Proceedings of the 25th international conference on Machine learning
New Error Bounds for Approximations from Projected Linear Equations

Recent Advances in Reinforcement Learning
Projected equation methods for approximate solution of large linear systems

Journal of Computational and Applied Mathematics
Optimal Online Learning Procedures for Model-Free Policy Evaluation

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Real-time reinforcement learning by sequential Actor-Critics and experience replay

Neural Networks
Error Bounds for Approximations from Projected Linear Equations

Mathematics of Operations Research
Generalized TD Learning

The Journal of Machine Learning Research
Recursive least-squares learning with eligibility traces

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Unified inter and intra options learning using policy gradient methods

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Batch, off-policy and model-free apprenticeship learning

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Performance bounds for λ policy iteration and application to the game of Tetris

The Journal of Machine Learning Research
Reinforcement learning algorithms with function approximation: Recent advances and applications

Information Sciences: an International Journal
Policy oscillation is overshooting

Neural Networks

Quantified Score

Hi-index	0.01

Visualization

Abstract

We consider policy evaluation algorithms within the context of infinite-horizon dynamic programming problems with discounted cost. We focus on discrete-time dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(λ) algorithm recently proposed by Boyan, which for λ=0 coincides with the linear least-squares temporal-difference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1].