On Average Versus Discounted Reward Temporal-Difference Learning

Authors:
John N. Tsitsiklis;Benjamin Van Roy
Affiliations:
Laboratory for Information and Decision Systems, M.I.T., Cambridge, MA 01239, USA. jnt@mit.edu;Department of Management Science and Engineering and Electrical Engineering, Stanford University, Stanford, CA 94305, USA. bvr@stanford.edu
Venue:
Machine Learning
Year:
2002

Citing 9
Cited 9

Reinforcement learning algorithms for average-payoff Markovian decision processes

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Average reward reinforcement learning: foundations, algorithms, and empirical results

Machine Learning - Special issue on reinforcement learning
Model-based average reward reinforcement learning

Artificial Intelligence
Reinforcement learning for call admission control and routing in integrated service networks

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Stochastic approximation for non-expansive maps: applications to q-learning algorithms

Stochastic approximation for non-expansive maps: applications to q-learning algorithms
Learning and value function approximation in complex decision processes

Learning and value function approximation in complex decision processes
Brief paper: Average cost temporal-difference learning

Automatica (Journal of IFAC)

Long-term reward prediction in TD models of the dopamine system

Neural Computation
Optimizing Average Reward Using Discounted Rewards

COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Representation and timing in theories of the dopamine system

Neural Computation
Performance Loss Bounds for Approximate Value Iteration with State Aggregation

Mathematics of Operations Research
Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning

Neural Computation
Hyperbolically discounted temporal difference learning

Neural Computation
Internal-time temporal difference model for neural value-based decision making

Neural Computation
Compound reinforcement learning: theory and an application to finance

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Neural networks letter: Reinforcement learning for discounted values often loses the goal in the application to animal learning

Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms.