Robust Markov Decision Processes

Authors:
Wolfram Wiesemann;Daniel Kuhn;Berç Rustem
Affiliations:
Department of Computing, Imperial College London, London SW7 2AZ, United Kingdom;Department of Computing, Imperial College London, London SW7 2AZ, United Kingdom;Department of Computing, Imperial College London, London SW7 2AZ, United Kingdom
Venue:
Mathematics of Operations Research
Year:
2013

Citing 10
Cited 0

Semidefinite programming

SIAM Review
Bounded-parameter Markov decision process

Artificial Intelligence
Robust Dynamic Programming

Mathematics of Operations Research
Bias and Variance Approximation in Value Function Estimates

Management Science
A New Complexity Result on Solving the Markov Decision Problem

Mathematics of Operations Research
Robust Control of Markov Decision Processes with Uncertain Transition Matrices

Operations Research
Dynamic Programming and Optimal Control, Vol. II

Dynamic Programming and Optimal Control, Vol. II
Percentile Optimization for Markov Decision Processes with Parameter Uncertainty

Operations Research
Distributionally Robust Optimization and Its Tractable Approximations

Operations Research
Primal and dual linear decision rules in stochastic and robust optimization

Mathematical Programming: Series A and B

Quantified Score

Hi-index	0.00

Visualization

Abstract

Markov decision processes MDPs are powerful tools for decision making in uncertain dynamic environments. However, the solutions of MDPs are of limited practical use because of their sensitivity to distributional model parameters, which are typically unknown and have to be estimated by the decision maker. To counter the detrimental effects of estimation errors, we consider robust MDPs that offer probabilistic guarantees in view of the unknown parameters. To this end, we assume that an observation history of the MDP is available. Based on this history, we derive a confidence region that contains the unknown parameters with a prespecified probability 1-β. Afterward, we determine a policy that attains the highest worst-case performance over this confidence region. By construction, this policy achieves or exceeds its worst-case performance with a confidence of at least 1-β. Our method involves the solution of tractable conic programs of moderate size.