Brief paper: Average cost temporal-difference learning

  • Authors:
  • John N. Tsitsiklis;Benjamin Van Roy

  • Affiliations:
  • Laboratory for Information and Decision Systems, Room 35-209, 77 Massachusetts Avenue, Massachusetts Institute of Technology, Cambridge, MA 02139-4307, USA;Laboratory for Information and Decision Systems, Room 35-209, 77 Massachusetts Avenue, Massachusetts Institute of Technology, Cambridge, MA 02139-4307, USA

  • Venue:
  • Automatica (Journal of IFAC)
  • Year:
  • 1999

Quantified Score

Hi-index 22.15

Visualization

Abstract

We propose a variant of temporal-difference learning that approximates average and differential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of fixed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain. We present a proof of convergence (with probability 1) and a characterization of the limit of convergence. We also provide a bound on the resulting approximation error that exhibits an interesting dependence on the ''mixing time'' of the Markov chain. The results parallel previous work by the authors, involving approximations of discounted cost-to-go.