Reinforcement learning algorithms for average-payoff Markovian decision processes
AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Average reward reinforcement learning: foundations, algorithms, and empirical results
Machine Learning - Special issue on reinforcement learning
Model-based average reward reinforcement learning
Artificial Intelligence
Reinforcement learning for call admission control and routing in integrated service networks
NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Dynamic Programming and Optimal Control
Dynamic Programming and Optimal Control
Learning to Predict by the Methods of Temporal Differences
Machine Learning
Stochastic approximation for non-expansive maps: applications to q-learning algorithms
Stochastic approximation for non-expansive maps: applications to q-learning algorithms
Learning and value function approximation in complex decision processes
Learning and value function approximation in complex decision processes
Brief paper: Average cost temporal-difference learning
Automatica (Journal of IFAC)
Long-term reward prediction in TD models of the dopamine system
Neural Computation
Optimizing Average Reward Using Discounted Rewards
COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Representation and timing in theories of the dopamine system
Neural Computation
Performance Loss Bounds for Approximate Value Iteration with State Aggregation
Mathematics of Operations Research
Hyperbolically discounted temporal difference learning
Neural Computation
Compound reinforcement learning: theory and an application to finance
EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Hi-index | 0.00 |
We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms.