On-Line Policy Gradient Estimation with Multi-Step Sampling

Authors:
Yan-Jie Li;Fang Cao;Xi-Ren Cao
Affiliations:
Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Kowloon, Hong Kong and Division of Control and Mechatronics Engineering, Harbin Institute of ...;Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Kowloon, Hong Kong and School of Electronics and Information Engineering, Beijing Jiaotong Un ...;Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Kowloon, Hong Kong
Venue:
Discrete Event Dynamic Systems
Year:
2010

Citing 7
Cited 0

Dynamic Programming and Optimal Control, Two Volume Set

Dynamic Programming and Optimal Control, Two Volume Set
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Algorithms

Introduction to Algorithms
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

The Journal of Machine Learning Research
Stochastic Learning and Optimization: A Sensitivity-Based Approach (International Series on Discrete Event Dynamic Systems)

Stochastic Learning and Optimization: A Sensitivity-Based Approach (International Series on Discrete Event Dynamic Systems)
Infinite-horizon policy-gradient estimation

Journal of Artificial Intelligence Research
Experiments with infinite-horizon, policy-gradient estimation

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this note, we discuss the problem of the sample-path-based (on-line) performance gradient estimation for Markov systems. The existing on-line performance gradient estimation algorithms generally require a standard importance sampling assumption. When the assumption does not hold, these algorithms may lead to poor estimates for the gradients. We show that this assumption can be relaxed and propose algorithms with multi-step sampling for performance gradient estimates; these algorithms do not require the standard assumption. Simulation examples are given to illustrate the accuracy of the estimates.