Simulation-based optimization of Markov decision processes: An empirical process theory approach

Authors:
Rahul Jain;Pravin Varaiya
Affiliations:
EE Department, University of Southern California, Los Angeles, CA 90089, USA and ISE Department, University of Southern California, Los Angeles, CA 90089, USA;EECS Department, University of California, Berkeley, CA 94720, USA
Venue:
Automatica (Journal of IFAC)
Year:
2010

Citing 13
Cited 0

Decision theoretic generalizations of the PAC model for neural net and other learning applications

Information and Computation
Rate of convergence of empirical measures and costs in controlled Markov chains and transient optimality

Mathematics of Operations Research
Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems

Mathematics of Operations Research
Scale-sensitive dimensions, uniform convergence, and learnability

Journal of the ACM (JACM)
Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes

Discrete Event Dynamic Systems
PEGASUS: A policy search method for large MDPs and POMDPs

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Bounds on Sample Size for Policy Evaluation in Markov Environments

COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Simulation-based Uniform Value Function Estimates of Markov Decision Processes

SIAM Journal on Control and Optimization
On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies

Mathematics of Operations Research
Infinite-horizon policy-gradient estimation

Journal of Artificial Intelligence Research
Neural Network Learning: Theoretical Foundations

Neural Network Learning: Theoretical Foundations
Stochastic Learning and Optimization: A Sensitivity-Based Approach

Stochastic Learning and Optimization: A Sensitivity-Based Approach
Learning and Generalization: With Applications to Neural Networks

Learning and Generalization: With Applications to Neural Networks

Quantified Score

Hi-index	22.14

Visualization

Abstract

We generalize and build on the PAC Learning framework for Markov Decision Processes developed in Jain and Varaiya (2006). We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an @e-optimal policy from simulation. We provide sample complexity of such an approach.