Online learning in Markov decision processes with arbitrarily changing rewards and transitions

Authors:
Jia Yuan Yu;Shie Mannor
Affiliations:
Department of Electrical and Computer Engineering, McGill University;Department of Electrical and Computer Engineering, McGill University and Department of Electrical Engineering, Technion
Venue:
GameNets'09 Proceedings of the First ICST international conference on Game Theory for Networks
Year:
2009

Citing 13
Cited 0

The weighted majority algorithm

Information and Computation
Competitive Markov decision processes

Competitive Markov decision processes
Locally Weighted Learning

Artificial Intelligence Review - Special issue on lazy learning
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Neuro-Dynamic Programming

Neuro-Dynamic Programming
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
The empirical Bayes envelope and regret minimization in competitive Markov decision processes

Mathematics of Operations Research
R-max - a general polynomial time algorithm for near-optimal reinforcement learning

The Journal of Machine Learning Research
Efficient algorithms for online decision problems

Journal of Computer and System Sciences - Special issue: Learning theory 2003
Prediction, Learning, and Games

Prediction, Learning, and Games
Robust Control of Markov Decision Processes with Uncertain Transition Matrices

Operations Research
Markov Decision Processes with Arbitrary Reward Processes

Mathematics of Operations Research
Reliable communication under channel uncertainty

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., non-stationary) fashion. We present algorithms that combine online learning and robust control, and establish guarantees on their performance evaluated in retrospect against alternative policies--i.e., their regret. These guarantees depend critically on the range of uncertainty in the transition probabilities, but hold regardless of the changes in rewards and transition probabilities over time. We present a version of the main algorithm in the setting where the decision-maker's observations are limited to its trajectory, and another version that allows a trade-off between performance and computational complexity.