On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies

Authors:
Shie Mannor;John N. Tsitsiklis
Affiliations:
Department of Electrical and Computer Engineering, McGill University, 3480 University Street, Montreal, Québec, Canada H3A 2A7;Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Venue:
Mathematics of Operations Research
Year:
2005

Citing 7
Cited 5

Stochastic systems: estimation, identification and adaptive control

Stochastic systems: estimation, identification and adaptive control
Markov decision problems and state-action frequencies

SIAM Journal on Control and Optimization
Rate of convergence of empirical measures and costs in controlled Markov chains and transient optimality

Mathematics of Operations Research
Introduction to Linear Optimization

Introduction to Linear Optimization
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Finite State Markovian Decision Processes

Finite State Markovian Decision Processes

The error exponent of variable-length codes over Markov channels with feedback

IEEE Transactions on Information Theory
An Anonymous Sequential Game Approach for Battery State Dependent Power Control

NET-COOP '09 Proceedings of the 3rd Euro-NF Conference on Network Control and Optimization
Simulation-based optimization of Markov decision processes: An empirical process theory approach

Automatica (Journal of IFAC)
NP-Hardness of checking the unichain condition in average cost MDPs

Operations Research Letters
Fast convergence to state-action frequency polytopes for MDPs

Operations Research Letters

Quantified Score

Hi-index	0.06

Visualization

Abstract

We consider the empirical state-action frequencies and the empirical reward in weakly communicating finite-state Markov decision processes under general policies. We define a certain polytope and establish that every element of this polytope is the limit of the empirical frequency vector, under some policy, in a strong sense. Furthermore, we show that the probability of exceeding a given distance between the empirical frequency vector and the polytope decays exponentially with time under every policy. We provide similar results for vector-valued empirical rewards.