Error bounds of optimization algorithms for semi-Markov decision processes

Authors:
Tang Hao;Yin Baoqun;Xi Hongsheng
Affiliations:
School of Computer and Information, Hefei University of Technology, Hefei, Anhui, P.R.China;Department of Automation, University of Science and Technology of China, Hefei, Anhui, P.R. China;Department of Automation, University of Science and Technology of China, Hefei, Anhui, P.R. China
Venue:
International Journal of Systems Science
Year:
2007

Citing 7
Cited 0

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
The Relations Among Potentials, Perturbation Analysis,and Markov Decision Processes

Discrete Event Dynamic Systems
From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning

Discrete Event Dynamic Systems
The optimal robust control policy for uncertain semi-Markov control processes

International Journal of Systems Science
Technical Communique: A unified approach to Markov decision problems and performance sensitivity analysis

Automatica (Journal of IFAC)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cao's work shows that, by defining an α-dependent equivalent infinitesimal generator Aα, a semi-Markov decision process (SMDP) with both average- and discounted-cost criteria can be treated as an α-equivalent Markov decision process (MDP), and the performance potential theory can also be developed for SMDPs. In this work, we focus on establishing error bounds for potential and Aα-based iterative optimization methods. First, we introduce an α-uniformized Markov chain (UMC) for a SMDP via Aα and a uniformized parameter, and show their relations. Especially, we obtain that their performance potentials, as solutions of corresponding Poisson equations, are proportional, so that the studies of a SMDP and the α-UMC based on potentials are unified. Using these relations, we derive the error bounds for a potential-based policy-iteration algorithm and a value-iteration algorithm, respectively, when there exist various calculation errors. The obtained results can be applied directly to the special models, i.e., continuous-time MDPs and Markov chains, and can be extended to some simulation-based optimization methods such as reinforcement learning and neurodynamic programming, where estimation errors or approximation errors are common cases. Finally, we give an application example on the look-ahead control of a conveyor-serviced production station (CSPS), and show the corresponding error bounds.