The actor-critic learning is behind the matching law: Matching versus optimal behaviors

Authors:
Yutaka Sakai;Tomoki Fukai
Affiliations:
Department of Intelligent Information Systems, Tamagawa University, Machida, Tokyo 194-8610, Japan sakai@eng.tamagawa.ac.jp;Laboratory for Neural Circuit Theory, Brain Science Institute, RIKEN, Wako, Saitama 351-0198, Japan tfukai@brain.riken.jp
Venue:
Neural Computation
Year:
2008

Citing 2
Cited 3

Reinforcement Learning

Reinforcement Learning
Long-term reward prediction in TD models of the dopamine system

Neural Computation

Operant matching as a nash equilibrium of an intertemporal game

Neural Computation
Statistical mechanics of reward-modulated learning in decision-making networks

Neural Computation
Dynamical regimes in neural network models of matching behavior

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ability to make a correct choice of behavior from various options is crucial for animals' survival. The neural basis for the choice of behavior has been attracting growing attention in research on biological and artificial neural systems. Alternative choice tasks with variable ratio (VR) and variable interval (VI) schedules of reinforcement have often been employed in studying decision making by animals and humans. In the VR schedule task, alternative choices are reinforced with different probabilities, and subjects learn to select the behavioral response rewarded more frequently. In the VI schedule task, alternative choices are reinforced at different average intervals independent of the choice frequencies, and the choice behavior follows the so-called matching law. The two policies appear robustly in subjects' choice of behavior, but the underlying neural mechanisms remain unknown. Here, we show that these seemingly different policies can appear from a common computational algorithm known as actor-critic learning. We present experimentally testable variations of the VI schedule in which the matching behavior gives only a suboptimal solution to decision making and show that the actor-critic system exhibits the matching behavior in the steady state of the learning even when the matching behavior is suboptimal. However, it is found that the matching behavior can earn approximately the same reward as the optimal one in many practical situations.