Stimulus representation and the timing of reward-prediction errors in models of the dopamine system

Authors:
Elliot A. Ludvig;Richard S. Sutton;E. James Kehoe
Affiliations:
University of Alberta, Edmonton, Alberta T6G 2E8, Canada. elliot@cs.ualberta.ca;University of Alberta, Edmonton, Alberta T6G 2E8, Canada. sutton@cs.ualberta.ca;University of New South Wales, Sydney 2052, New South Wales, Australia. j.kehoe@unsw.edu.au
Venue:
Neural Computation
Year:
2008

Citing 6
Cited 4

Adaptive timing in neural networks: the conditioned response

Biological Cybernetics
Neural dynamics of adaptive timing temporal discrimination during associative learning

Neural Networks
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Opponent interactions between serotonin and dopamine

Neural Networks - Computational models of neuromodulation
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Representation and timing in theories of the dopamine system

Neural Computation

Internal-time temporal difference model for neural value-based decision making

Neural Computation
Error-backpropagation in networks of fractionally predictive spiking neurons

ICANN'11 Proceedings of the 21th international conference on Artificial neural networks - Volume Part I
From modulated Hebbian plasticity to simple behavior learning through noise and weight saturation

Neural Networks
Dopamine ramps are a consequence of reward prediction errors

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The phasic firing of dopamine neurons has been theorized to encode a reward-prediction error as formalized by the temporal-difference (TD) algorithm in reinforcement learning. Most TD models of dopamine have assumed a stimulus representation, known as the complete serial compound, in which each moment in a trial is distinctly represented. We introduce a more realistic temporal stimulus representation for the TD model. In our model, all external stimuli, including rewards, spawn a series of internal microstimuli, which grow weaker and more diffuse over time. These microstimuli are used by the TD learning algorithm to generate predictions of future reward. This new stimulus representation injects temporal generalization into the TD model and enhances correspondence between model and data in several experiments, including those when rewards are omitted or received early. This improved fit mostly derives from the absence of large negative errors in the new model, suggesting that dopamine alone can encode the full range of TD errors in these situations.