Interactive value iteration for Markov decision processes with unknown rewards

Authors:
Paul Weng;Bruno Zanuttini
Affiliations:
UPMC, LIP6, France;University of Caen, France
Venue:
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Year:
2013

Citing 10
Cited 0

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Algorithms for Inverse Reinforcement Learning

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Convex Optimization

Convex Optimization
Online feature elicitation in interactive optimization

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Regret-based reward elicitation for Markov decision processes

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
On the complexity of solving Markov decision problems

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Cooperative negotiation in autonomic systems using incremental utility elicitation

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
A planning system based on Markov decision processes to guide people with dementia through activities of daily living

IEEE Transactions on Information Technology in Biomedicine
Eliciting additive reward functions for Markov decision processes

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Robust online optimization of reward-uncertain MDPs

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three

Quantified Score

Hi-index	0.00

Visualization

Abstract

To tackle the potentially hard task of defining the reward function in a Markov Decision Process, we propose a new approach, based on Value Iteration, which interweaves the elicitation and optimization phases. We assume that rewards whose numeric values are unknown can only be ordered, and that a tutor is present to help comparing sequences of rewards. We first show how the set of possible reward functions for a given preference relation can be represented as a polytope. Then our algorithm, called Interactive Value Iteration, searches for an optimal policy while refining its knowledge about the possible reward functions, by querying a tutor when necessary. We prove that the number of queries needed before finding an optimal policy is upperbounded by a polynomial in the size of the problem, and we present experimental results which demonstrate that our approach is efficient in practice.