Markov Decision Processes: Discrete Stochastic Dynamic Programming
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Algorithms for Inverse Reinforcement Learning
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Convex Optimization
Online feature elicitation in interactive optimization
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Regret-based reward elicitation for Markov decision processes
UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
On the complexity of solving Markov decision problems
UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Cooperative negotiation in autonomic systems using incremental utility elicitation
UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
IEEE Transactions on Information Technology in Biomedicine
Eliciting additive reward functions for Markov decision processes
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Robust online optimization of reward-uncertain MDPs
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Hi-index | 0.00 |
To tackle the potentially hard task of defining the reward function in a Markov Decision Process, we propose a new approach, based on Value Iteration, which interweaves the elicitation and optimization phases. We assume that rewards whose numeric values are unknown can only be ordered, and that a tutor is present to help comparing sequences of rewards. We first show how the set of possible reward functions for a given preference relation can be represented as a polytope. Then our algorithm, called Interactive Value Iteration, searches for an optimal policy while refining its knowledge about the possible reward functions, by querying a tutor when necessary. We prove that the number of queries needed before finding an optimal policy is upperbounded by a polynomial in the size of the problem, and we present experimental results which demonstrate that our approach is efficient in practice.