Performance Guarantees for Empirical Markov Decision Processes with Applications to Multiperiod Inventory Models

Authors:
William L. Cooper;Bharath Rangarajan
Affiliations:
Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, Minnesota 55455;Merchandising Operations, Target Corporation, Minneapolis, Minnesota 55402
Venue:
Operations Research
Year:
2012

Citing 23
Cited 0

Efficient reinforcement learning

COLT '94 Proceedings of the seventh annual conference on Computational learning theory
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Neuro-Dynamic Programming

Neuro-Dynamic Programming
The Sample Average Approximation Method for Stochastic Discrete Optimization

SIAM Journal on Optimization
A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes

Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
Polynomial-time reinforcement learning of near-optimal policies

Eighteenth national conference on Artificial intelligence
An Adaptive Sampling Algorithm for Solving Markov Decision Processes

Operations Research
A Generalization Error for Q-Learning

The Journal of Machine Learning Research
A theoretical analysis of Model-Based Interval Estimation

ICML '05 Proceedings of the 22nd international conference on Machine learning
Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering)

Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering)
Bias and Variance Approximation in Value Function Estimates

Management Science
Simulation-based Uniform Value Function Estimates of Markov Decision Processes

SIAM Journal on Control and Optimization
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

The Journal of Machine Learning Research
Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics)

Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics)
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Machine Learning
Provably Near-Optimal Sampling-Based Policies for Stochastic Inventory Control Models

Mathematics of Operations Research
Regret in the Newsvendor Model with Partial Information

Operations Research
Neural Network Learning: Theoretical Foundations

Neural Network Learning: Theoretical Foundations
PEGASUS: a policy search method for large MDPs and POMDPs

UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Inventory management under highly uncertain demand

Operations Research Letters
On complexity of multistage stochastic programs

Operations Research Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider Markov decision processes with unknown transition probabilities and unknown single-period expected cost functions, and we study a method for estimating these quantities from historical or simulated data. The method requires knowledge of the system equations that govern state transitions as well as the single-period cost functions but not the single-period expected cost functions. The estimation procedure is based upon taking expectations with respect to the empirical distribution functions of such data. Once the estimates are in place, the method computes a policy by solving the obtained “empirical” Markov decision process as if the estimates were correct. For MDPs that satisfy some conditions, we provide explicit, easily computed expressions for the probability that the procedure will produce a policy whose true expected cost is within any specified absolute distance of the actual optimal expected cost of the true Markov decision process. We also provide expressions for the number of historical or simulated data values that is sufficient for the procedure to produce a policy whose true expected cost is, with a prescribed probability, within a prescribed absolute distance of the actual optimal expected cost of the true Markov decision process. We apply our results to multiperiod inventory models. In addition, we provide a specialized analysis of such inventory models that also yields relative, rather than absolute, accuracy guarantees. We make comparisons with related results that have recently appeared, and we provide numerical examples.