Optimizing Average Reward Using Discounted Rewards

Authors:
Sham Kakade
Affiliations:
-
Venue:
COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Year:
2001

Citing 6
Cited 1

Dynamic Programming and Optimal Control, Two Volume Set

Dynamic Programming and Optimal Control, Two Volume Set
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
On Average Versus Discounted Reward Temporal-Difference Learning

Machine Learning
Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning

COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Infinite-horizon policy-gradient estimation

Journal of Artificial Intelligence Research
Brief paper: Average cost temporal-difference learning

Automatica (Journal of IFAC)

General discounting versus average reward

ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many reinforcement learningproblems, it is appropriate to optimize the average reward. In practice, this is often done by solving the Bellman equations usinga discount factor close to 1. In this paper, we provide a bound on the average reward of the policy obtained by solving the Bellman equations which depends on the relationship between the discount factor and the mixingtime of the Markov chain. We extend this result to the direct policy gradient of Baxter and Bartlett, in which a discount parameter is used to find a biased estimate of the gradient of the average reward with respect to the parameters of a policy. We show that this biased gradient is an exact gradient of a related discounted problem and provide a bound on the optima found by following these biased gradients of the average reward. Further, we show that the exact Hessian in this related discounted problem is an approximate Hessian of the average reward, with equality in the limit the discount factor tends to 1. We then provide an algorithm to estimate the Hessian from a sample path of the underlyingMark ov chain, which converges with probability 1.