From Q(λ) to average Q-learning: efficient implementation of an asymptotic approximation

Authors:
Frédérick Garcia;Florent Serre
Affiliations:
INRA, Unité de Biométrie et Intelligence Artificielle, Castanet Tolosan cedex, France;INRA, Unité de Biométrie et Intelligence Artificielle, Castanet Tolosan cedex, France
Venue:
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Year:
2001

Citing 12
Cited 0

Adaptive algorithms and stochastic approximations

Adaptive algorithms and stochastic approximations
Reinforcement learning with replacing eligibility traces

Machine Learning - Special issue on reinforcement learning
Stochastic approximation with two time scales

Systems & Control Letters
Fast Online Q(λ)

Machine Learning
Analytical Mean Squared Error Curves for Temporal DifferenceLearning

Machine Learning
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning

SIAM Journal on Control and Optimization
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Learning to Predict by the Methods of Temporal Differences

Machine Learning
A Learning Rate Analysis of Reinforcement Learning Algorithms in Finite-Horizon

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Bias-Variance Error Bounds for Temporal Difference Updates

COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Truncating temporal differences: on the efficient implementation of TD (λ) for reinforcement learning

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Q(λ) is a reinforcement learning algorithm that combines Q-learning and TD(λ). Online implementations of Q(λ) that use eligibility traces have been shown to speed basic Q-learning. In this paper we present an asymptotic analysis of Watkins' Q(λ) with accumulative eligibility traces. We first introduce an asymptotic approximation of Q(λ) that appears to be a gain matrix variant of basic Q-learning. Using the ODE method, we then determine an optimal gain matrix for Q-learning that maximizes its rate of convergence toward the optimal value function Q*. The similarity between this optimal gain and the asymptotic gain of Q(λ) explains the relative efficiency of the latter for (λ) 0. Furthermore, by minimizing the difference between these two gains, optimal values for the λ parameter and the decreasing learning rates can be determined. This optimal λ strongly depends on the exploration policy during learning. A robust approximation of these learning parameters leads to the definition of a new efficient algorithm called AQ-learning (Average Q-learning), that shows a close resemblance to Schwartz' R-learning. Our results have been demonstrated through numerical simulations.