Action Time Sharing Policies for Ergodic Control of Markov Chains

Authors:
Amarjit Budhiraja;Xin Liu;Adam Shwartz
Affiliations:
budhiraj@email.unc.edu;liuxin@ima.umn.edu;adam@ee.technion.ac.il
Venue:
SIAM Journal on Control and Optimization
Year:
2012

Citing 9
Cited 0

Markov decision problems and state-action frequencies

SIAM Journal on Control and Optimization
Discrete-time controlled Markov processes with average cost criterion: a survey

SIAM Journal on Control and Optimization
On the Milito-Cruz adaptive control scheme for Markov chains

Journal of Optimization Theory and Applications
Probability (2nd ed.)

Probability (2nd ed.)
Asymptotically efficient adaptive strategies in repeated games part II: asymptotic optimality

Mathematics of Operations Research
Optimal adaptive policies for Markov decision processes

Mathematics of Operations Research
Prediction, Learning, and Games

Prediction, Learning, and Games
Introduction to Probability Models, Ninth Edition

Introduction to Probability Models, Ninth Edition
Efficient Reinforcement Learning in Parameterized Models: Discrete Parameter Case

Recent Advances in Reinforcement Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ergodic control for discrete time controlled Markov chains with a locally compact state space and a compact action space is considered under suitable stability, irreducibility, and Feller continuity conditions. A flexible family of controls, called action time sharing (ATS) policies, associated with a given continuous stationary Markov control, is introduced. It is shown that the long-term average cost for such a control policy, for a broad range of one-stage cost functions, is the same as that for the associated stationary Markov policy. In addition, ATS policies are well suited for a range of estimation, information collection, and adaptive control goals. To illustrate the possibilities we present two examples. The first demonstrates a construction of an ATS policy that leads to consistent estimators for unknown model parameters while producing the desired long-term average cost value. The second example considers a setting where the target stationary Markov control $q$ is not known but there are sampling schemes available that allow for consistent estimation of $q$. We construct an ATS policy which uses dynamic estimators for $q$ for control decisions and show that the associated cost coincides with that for the unknown Markov control $q$.