A geometric approach to find nondominated policies to imprecise reward MDPs

Authors:
Valdinei Freire da Silva;Anna Helena Reali Costa
Affiliations:
Universidade de São Paulo, São Paulo, Brazil;Universidade de São Paulo, São Paulo, Brazil
Venue:
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
Year:
2011

Citing 9
Cited 0

Dynamic programming: deterministic and stochastic models

Dynamic programming: deterministic and stochastic models
The quickhull algorithm for convex hulls

ACM Transactions on Mathematical Software (TOMS)
Variable Resolution Discretization in Optimal Control

Machine Learning
Making Rational Decisions Using Adaptive Utility Elicitation

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
New approaches to optimization and utility elicitation in autonomic computing

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 1
Efficient solution algorithms for factored MDPs

Journal of Artificial Intelligence Research
Planning and acting in partially observable stochastic domains

Artificial Intelligence
Constraint-based optimization and utility elicitation using the minimax decision criterion

Artificial Intelligence
Regret-based reward elicitation for Markov decision processes

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

Quantified Score

Hi-index	0.01

Visualization

Abstract

Markov Decision Processes (MDPs) provide a mathematical framework for modelling decision-making of agents acting in stochastic environments, in which transitions probabilities model the environment dynamics and a reward function evaluates the agent's behaviour. Lately, however, special attention has been brought to the difficulty of modelling precisely the reward function, which has motivated research on MDP with imprecisely specified reward. Some of these works exploit the use of nondominated policies, which are optimal policies for some instantiation of the imprecise reward function. An algorithm that calculates nondominated policies is π Witness, and nondominated policies are used to take decision under the minimax regret evaluation. An interesting matter would be defining a small subset of nondominated policies so that the minimax regret can be calculated faster, but accurately. We modified π Witness to do so. We also present the π Hull algorithm to calculate nondominated policies adopting a geometric approach. Under the assumption that reward functions are linearly defined on a set of features, we show empirically that pHull can be faster than our modified version of π Witness.