Dynamic Programming and Optimal Control, Two Volume Set
Dynamic Programming and Optimal Control, Two Volume Set
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
On Average Versus Discounted Reward Temporal-Difference Learning
Machine Learning
Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning
COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Infinite-horizon policy-gradient estimation
Journal of Artificial Intelligence Research
Brief paper: Average cost temporal-difference learning
Automatica (Journal of IFAC)
General discounting versus average reward
ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Hi-index | 0.00 |
In many reinforcement learningproblems, it is appropriate to optimize the average reward. In practice, this is often done by solving the Bellman equations usinga discount factor close to 1. In this paper, we provide a bound on the average reward of the policy obtained by solving the Bellman equations which depends on the relationship between the discount factor and the mixingtime of the Markov chain. We extend this result to the direct policy gradient of Baxter and Bartlett, in which a discount parameter is used to find a biased estimate of the gradient of the average reward with respect to the parameters of a policy. We show that this biased gradient is an exact gradient of a related discounted problem and provide a bound on the optima found by following these biased gradients of the average reward. Further, we show that the exact Hessian in this related discounted problem is an approximate Hessian of the average reward, with equality in the limit the discount factor tends to 1. We then provide an algorithm to estimate the Hessian from a sample path of the underlyingMark ov chain, which converges with probability 1.