CONVERGENCE OF SIMULATION-BASED POLICY ITERATION

Authors:
William L. Cooper;Shane G. Henderson;Mark E. Lewis
Affiliations:
Department of Mechanical Engineering, University of Minnesota, Minneapolis, MN 55455, E-mail: billcoop@me.umn.edu;School of Operations Research and Industrial Engineering, Cornell University, Ithaca, NY 14853, E-mail: shane@orie.cornell.edu;Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109-2117, E-mail: melewis@engin.umich.edu
Venue:
Probability in the Engineering and Informational Sciences
Year:
2003

Citing 12
Cited 3

Adventures in stochastic processes

Adventures in stochastic processes
Technical Note: \cal Q-Learning

Machine Learning
Asynchronous Stochastic Approximation and Q-Learning

Machine Learning
Exact sampling with coupled Markov chains and applications to statistical mechanics

Proceedings of the seventh international conference on Random structures and algorithms
How to get a perfectly random sample from a generic Markov chain and generate a random spanning tree of a directed graph

Journal of Algorithms
Single sample path-based optimization of Markov chains

Journal of Optimization Theory and Applications - Special issue in honor of Yu-Chi Ho
Actor-Critic--Type Learning Algorithms for Markov Decision Processes

SIAM Journal on Control and Optimization
Dynamic Programming and Optimal Control, Two Volume Set

Dynamic Programming and Optimal Control, Two Volume Set
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Neuro-Dynamic Programming

Neuro-Dynamic Programming
The Relations Among Potentials, Perturbation Analysis,and Markov Decision Processes

Discrete Event Dynamic Systems
Technical Communique: A unified approach to Markov decision problems and performance sensitivity analysis

Automatica (Journal of IFAC)

Variable-sample methods for stochastic optimization

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Basic Ideas for Event-Based Optimization of Markov Systems

Discrete Event Dynamic Systems
A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases

Automatica (Journal of IFAC)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide easily verifiable conditions that ensure that simulation-based policy iteration almost-surely eventually never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the average evaluation equations. Using our general results, we derive simple conditions on the simulation run lengths that guarantee the almost-sure convergence of the algorithm.