General discounting versus average reward

Authors:
Marcus Hutter
Affiliations:
IDSIA / RSISE / ANU / NICTA /
Venue:
ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Year:
2006

Citing 8
Cited 5

Stochastic systems: estimation, identification and adaptive control

Stochastic systems: estimation, identification and adaptive control
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Optimizing Average Reward Using Discounted Rewards

COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Self-Optimizing and Pareto-Optimal Policies in General Environments Based on Bayes-Mixtures

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Artificial Intelligence: A Modern Approach

Artificial Intelligence: A Modern Approach
Universal Artificial Intelligence: Sequential Decisions Based On Algorithmic Probability

Universal Artificial Intelligence: Sequential Decisions Based On Algorithmic Probability
Reinforcement learning: a survey

Journal of Artificial Intelligence Research

Universal Intelligence: A Definition of Machine Intelligence

Minds and Machines
On the possibility of learning in reactive environments with arbitrary dependence

Theoretical Computer Science
Measuring universal intelligence: Towards an anytime intelligence test

Artificial Intelligence
Time consistent discounting

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
General time consistent discounting

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Consider an agent interacting with an environment in cycles. In every interaction cycle the agent is rewarded for its performance. We compare the average reward U from cycle 1 to m (average value) with the future discounted reward V from cycle k to ∞ (discounted value). We consider essentially arbitrary (non-geometric) discount sequences and arbitrary reward sequences (non-MDP environments). We show that asymptotically U for m→∞ and V for k→∞ are equal, provided both limits exist. Further, if the effective horizon grows linearly with k or faster, then the existence of the limit of U implies that the limit of V exists. Conversely, if the effective horizon grows linearly with k or slower, then existence of the limit of V implies that the limit of U exists.