TD(λ) Converges with Probability 1

  • Authors:
  • Peter Dayan;Terrence J. Sejnowski

  • Affiliations:
  • CNL, The Salk Institute, P.O. Box 85800, San Diego, CA 92186-5800. dayan@helmholtz.sdsc.edu;CNL, The Salk Institute, P.O. Box 85800, San Diego, CA 92186-5800. tsejnowski@uscd.edu

  • Venue:
  • Machine Learning
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future.Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as larger samples are taken, and Dayan (1992) extended his proof to the general case. This article proves the stronger result than the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence.