Propagation of Q-values in Tabular TD(lambda)

Authors:
Philippe Preux
Affiliations:
-
Venue:
ECML '02 Proceedings of the 13th European Conference on Machine Learning
Year:
2002

Citing 10
Cited 1

Integrated architecture for learning, planning, and reacting based on approximating dynamic programming

Proceedings of the seventh international conference (1990) on Machine learning
Efficient learning and planning within the Dyna framework

Adaptive Behavior
Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time

Machine Learning
Temporal difference learning and TD-Gammon

Communications of the ACM
The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces

Machine Learning
HQ-learning

Adaptive Behavior
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Learning to act using real-time dynamic programming

Artificial Intelligence

Learning polite behavior with situation models

Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new idea for tabular TD(驴) algorithm. In TD learning, rewards are propagated along the sequence of state/action pairs that have been visited recently. In complement to this, we propose to propagate rewards towards neighboring state/action pairs along this sequence, though unvisited. This leads to a great decrease in the number of iterations required for TD(驴) to be able to generalize since it is no longer necessary that a state/action pair is visited for its Q-value to be updated. The use of this propagation process makes tabular TD(驴) coming closer to neural net based TD(驴) with regards to its ability to generalize, while keeping unchanged other properties of tabular TD(驴).