Splitting Randomized Stationary Policies in Total-Reward Markov Decision Processes

Authors:
Eugene A. Feinberg;Uriel G. Rothblum
Affiliations:
Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, New York 11794;Faculty of Industrial Engineering and Management, Technion--Israel Institute of Technology, Haifa 32000, Israel
Venue:
Mathematics of Operations Research
Year:
2012

Citing 9
Cited 0

Matrix multiplication via arithmetic progressions

Journal of Symbolic Computation - Special issue on computational algebraic complexity
Non-randomized strategies in stochastic decision processes

Annals of Operations Research
Average optimality in dynamic programming with general state space

Mathematics of Operations Research
Constrained discounted dynamic programming

Mathematics of Operations Research
Constrained Discounted Markov Decision Processes and Hamiltonian Cycles

Mathematics of Operations Research
Finite State Markovian Decision Processes

Finite State Markovian Decision Processes
Controlled Markov chains, graphs, and Hamiltonicity

Foundations and Trends® in Stochastic Systems
Risk-Sensitive and Risk-Neutral Multiarmed Bandits

Mathematics of Operations Research
Refined MDP-Based Branch-and-Fix Algorithm for the Hamiltonian Cycle Problem

Mathematics of Operations Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies a discrete-time total-reward Markov decision process (MDP) with a given initial state distribution. A (randomized) stationary policy can be split on a given set of states if the occupancy measure of this policy can be expressed as a convex combination of the occupancy measures of stationary policies, each selecting deterministic actions on the given set and coinciding with the original stationary policy outside of this set. For a stationary policy, necessary and sufficient conditions are provided for splitting it at a single state as well as sufficient conditions for splitting it on the whole state space. These results are applied to constrained MDPs. The results are refined for absorbing (including discounted) MDPs with finite state and actions spaces. In particular, this paper provides an efficient algorithm that presents the occupancy measure of a given policy as a convex combination of the occupancy measures of finitely many (stationary) deterministic policies. This algorithm generates the splitting policies in a way that each pair of consecutive policies differs at exactly one state. The results are applied to constrained problems to efficiently compute an optimal policy by computing and splitting a stationary optimal policy.