Learning automata: an introduction
Learning automata: an introduction
Variance-penalized Markov decision processes
Mathematics of Operations Research
Risk sensitive reinforcement learning
Proceedings of the 1998 conference on Advances in neural information processing systems II
Dynamic Programming and Optimal Control
Dynamic Programming and Optimal Control
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Neuro-Dynamic Programming
Multi-criteria Reinforcement Learning
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Reinforcement Learning with Bounded Risk
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Introduction to Stochastic Search and Optimization
Introduction to Stochastic Search and Optimization
Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering)
Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering)
A risk-sensitive approach to total productive maintenance
Automatica (Journal of IFAC)
On step sizes, stochastic shortest paths, and survival probabilities in reinforcement learning
Proceedings of the 40th Conference on Winter Simulation
Infinite-horizon policy-gradient estimation
Journal of Artificial Intelligence Research
Reinforcement learning for model building and variance-penalized control
Winter Simulation Conference
Hi-index | 0.00 |
The variance-penalized metric in Markov decision processes (MDPs) seeks to maximize the average reward minus a scalar times the variance of rewards. In this paper, our goal is to study the same metric in the context of the semi-Markov decision process (SMDP). In the SMDP, unlike the MDP, the time spent in each transition is not identical and may in fact be a random variable. We first develop an expression for the variance of rewards in the SMDPs, and then formulate the VP-SMDP. Our interest here is in solving the problem without generating the underlying transition probabilities of the Markov chains. We propose the use of two stochastic search techniques, namely simultaneous perturbation and learning automata, to solve the problem; these techniques use stochastic policies and can be used within simulators, thereby avoiding the generation of the transition probabilities.