Stochastic policy search for variance-penalized semi-Markov control

Authors:
Abhijit Gosavi;Mandar Purohit
Affiliations:
Missouri University of Science and Technology, Rolla, MO;ESRI, Redlands, CA
Venue:
Proceedings of the Winter Simulation Conference
Year:
2011

Citing 14
Cited 0

Learning automata: an introduction

Learning automata: an introduction
Variance-penalized Markov decision processes

Mathematics of Operations Research
Risk sensitive reinforcement learning

Proceedings of the 1998 conference on Advances in neural information processing systems II
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Multi-criteria Reinforcement Learning

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Reinforcement Learning with Bounded Risk

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering)

Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering)
A risk-sensitive approach to total productive maintenance

Automatica (Journal of IFAC)
On step sizes, stochastic shortest paths, and survival probabilities in reinforcement learning

Proceedings of the 40th Conference on Winter Simulation
Infinite-horizon policy-gradient estimation

Journal of Artificial Intelligence Research
Reinforcement learning for model building and variance-penalized control

Winter Simulation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

The variance-penalized metric in Markov decision processes (MDPs) seeks to maximize the average reward minus a scalar times the variance of rewards. In this paper, our goal is to study the same metric in the context of the semi-Markov decision process (SMDP). In the SMDP, unlike the MDP, the time spent in each transition is not identical and may in fact be a random variable. We first develop an expression for the variance of rewards in the SMDPs, and then formulate the VP-SMDP. Our interest here is in solving the problem without generating the underlying transition probabilities of the Markov chains. We propose the use of two stochastic search techniques, namely simultaneous perturbation and learning automata, to solve the problem; these techniques use stochastic policies and can be used within simulators, thereby avoiding the generation of the transition probabilities.