Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes

Authors:
Peter Marbach;John N. Tsitsiklis
Affiliations:
Department of Computer Science, University of Toronto, Toronto, ON, M5S 3H4 marbach@cs.toronto.edu;Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139 jnt@mit.edu
Venue:
Discrete Event Dynamic Systems
Year:
2003

Citing 6
Cited 8

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Stochastic approximation for Monte Carlo optimization

WSC '86 Proceedings of the 18th conference on Winter simulation
Likelilood ratio gradient estimation: an overview

WSC '87 Proceedings of the 19th conference on Winter simulation
Dynamic Programming and Optimal Control, Two Volume Set

Dynamic Programming and Optimal Control, Two Volume Set
Reinforcement Learning in POMDPs with Function Approximation

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Technical Communique: A unified approach to Markov decision problems and performance sensitivity analysis

Automatica (Journal of IFAC)

Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity

Neural Computation
Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

The Journal of Machine Learning Research
Policy Gradient in Continuous Time

The Journal of Machine Learning Research
Geometric variance reduction in Markov chains: application to value function and gradient estimation

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Simulation-based optimization of Markov decision processes: An empirical process theory approach

Automatica (Journal of IFAC)
Decentralized algorithms for adaptive pricing in multiclass loss networks

IEEE/ACM Transactions on Networking (TON)
Analysis and improvement of policy gradient estimation

Neural Networks
Modeling and optimization of a product-service system with additional service capacity and impatient customers

Computers and Operations Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a discrete time, finite state Markov reward process that depends on a set of parameters. We start with a brief review of (stochastic) gradient descent methods that tune the parameters in order to optimize the average reward, using a single (possibly simulated) sample path of the process of interest. The resulting algorithms can be implemented online, and have the property that the gradient of the average reward converges to zero with probability 1. On the other hand, the updates can have a high variance, resulting in slow convergence. We address this issue and propose two approaches to reduce the variance. These approaches rely on approximate gradient formulas, which introduce an additional bias into the update direction. We derive bounds for the resulting bias terms and characterize the asymptotic behavior of the resulting algorithms. For one of the approaches considered, the magnitude of the bias term exhibits an interesting dependence on the time it takes for the rewards to reach steady-state. We also apply the methodology to Markov reward processes with a reward-free termination state, and an expected total reward criterion. We use a call admission control problem to illustrate the performance of the proposed algorithms.