Analytical Mean Squared Error Curves for Temporal DifferenceLearning

Authors:
Satinder Singh;Peter Dayan
Affiliations:
Department of Computer Science, University of Colorado, Boulder, CO 80309-0430. E-mail: baveja@cs.colorado.edu;Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139. E-mail: dayan@ai.mit.edu
Venue:
Machine Learning
Year:
1998

Citing 8
Cited 13

Adaptive signal processing

Adaptive signal processing
The Convergence of TD(λ) for General λ

Machine Learning
Rigorous learning curve bounds from statistical mechanics

COLT '94 Proceedings of the seventh annual conference on Computational learning theory
TD(λ) Converges with Probability 1

Machine Learning
Asynchronous Stochastic Approximation and Q-Learning

Machine Learning
Reinforcement learning with replacing eligibility traces

Machine Learning - Special issue on reinforcement learning
Learning curve bounds for a Markov decision process with undiscounted rewards

COLT '96 Proceedings of the ninth annual conference on Computational learning theory
Learning to Predict by the Methods of Temporal Differences

Machine Learning

Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
Reinforcement Learning Agents

Artificial Intelligence Review
Open Theoretical Questions in Reinforcement Learning

EuroCOLT '99 Proceedings of the 4th European Conference on Computational Learning Theory
Combining importance sampling and temporal difference control variates to simulate Markov Chains

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Adaptive Importance Sampling Technique for Markov Chains Using Stochastic Approximation

Operations Research
From Q(λ) to average Q-learning: efficient implementation of an asymptotic approximation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
A Convergent Online Single Time Scale Actor Critic Algorithm

The Journal of Machine Learning Research
Improving optimistic exploration in model-free reinforcement learning

ICANNGA'09 Proceedings of the 9th international conference on Adaptive and natural computing algorithms
A general framework to detect unsafe system states from multisensor data stream

IEEE Transactions on Intelligent Transportation Systems
Generalized TD Learning

The Journal of Machine Learning Research
Error bounds in reinforcement learning policy evaluation

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Integrating a partial model into model free reinforcement learning

The Journal of Machine Learning Research
Performance bounds for λ policy iteration and application to the game of Tetris

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We provide analytical expressions governing changes to the bias and variance of the lookup table estimators provided by various Monte Carlo and temporal difference value estimation algorithms with offline updates over trials in absorbing Markov reward processes. We have used these expressions to develop software that serves as an analysis tool: given a complete description of a Markov reward process, it rapidly yields an exact mean-square-error curve, the curve one would get from averaging together sample mean-square-error curves from an infinite number of learning trials on the given problem. We use our analysis tool to illustrate classes of mean-square-error curve behavior in a variety of example reward processes, and we show that although the various temporal difference algorithms are quite sensitive to the choice of step-size and eligibility-trace parameters, there are values of these parameters that make them similarly competent, and generally good.