TD(λ) Converges with Probability 1
Machine Learning
A counterexample to temporal differences learning
Neural Computation
Linear least-squares algorithms for temporal difference learning
Machine Learning - Special issue on reinforcement learning
Matrix computations (3rd ed.)
Dynamic Programming and Optimal Control
Dynamic Programming and Optimal Control
Neuro-Dynamic Programming
Gradient Convergence in Gradient methods with Errors
SIAM Journal on Optimization
Technical Update: Least-Squares Temporal Difference Learning
Machine Learning
Learning to Predict by the Methods of Temporal Differences
Machine Learning
Least-squares policy iteration
The Journal of Machine Learning Research
Performance Loss Bounds for Approximate Value Iteration with State Aggregation
Mathematics of Operations Research
Dynamic modeling and control of supply chain systems: A review
Computers and Operations Research
Preconditioned temporal difference learning
Proceedings of the 25th international conference on Machine learning
New Error Bounds for Approximations from Projected Linear Equations
Recent Advances in Reinforcement Learning
Projected equation methods for approximate solution of large linear systems
Journal of Computational and Applied Mathematics
Optimal Online Learning Procedures for Model-Free Policy Evaluation
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Error Bounds for Approximations from Projected Linear Equations
Mathematics of Operations Research
The Journal of Machine Learning Research
Recursive least-squares learning with eligibility traces
EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Unified inter and intra options learning using policy gradient methods
EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Batch, off-policy and model-free apprenticeship learning
EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Performance bounds for λ policy iteration and application to the game of Tetris
The Journal of Machine Learning Research
Reinforcement learning algorithms with function approximation: Recent advances and applications
Information Sciences: an International Journal
Policy oscillation is overshooting
Neural Networks
Hi-index | 0.01 |
We consider policy evaluation algorithms within the context of infinite-horizon dynamic programming problems with discounted cost. We focus on discrete-time dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(λ) algorithm recently proposed by Boyan, which for λ=0 coincides with the linear least-squares temporal-difference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1].