On the convergence of stochastic iterative dynamic programming algorithms

  • Authors:
  • Tommi Jaakkola;Michael I. Jordan;Satinder P. Singh

  • Affiliations:
  • Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA;Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA;Department of Computer Science, University of Massachusetts, Amherst, MA 01003 USA

  • Venue:
  • Neural Computation
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(λ) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(λ) and Q-learning belong.