Markov Decision Processes with Arbitrary Reward Processes

Authors:
Jia Yuan Yu;Shie Mannor;Nahum Shimkin
Affiliations:
Department of Electrical and Computer Engineering, McGill University, Montréal, Québec H3A 2A7, Canada;Department of Electrical and Computer Engineering, McGill University, Montréal, Québec H3A 2A7, Canada, and Technion, Technion City, 32000 Haifa, Israel;Department of Electrical Engineering, Technion, Technion City, 32000 Haifa, Israel
Venue:
Mathematics of Operations Research
Year:
2009

Citing 14
Cited 4

Technical Note: \cal Q-Learning

Machine Learning
The weighted majority algorithm

Information and Computation
Some perturbation theory for linear programming

Mathematical Programming: Series A and B
Competitive Markov decision processes

Competitive Markov decision processes
Tracking the Best Expert

Machine Learning - Special issue on context sensitivity and concept drift
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning

SIAM Journal on Control and Optimization
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Neuro-Dynamic Programming

Neuro-Dynamic Programming
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
The empirical Bayes envelope and regret minimization in competitive Markov decision processes

Mathematics of Operations Research
R-max - a general polynomial time algorithm for near-optimal reinforcement learning

The Journal of Machine Learning Research
Efficient algorithms for online decision problems

Journal of Computer and System Sciences - Special issue: Learning theory 2003
Prediction, Learning, and Games

Prediction, Learning, and Games
On sequential strategies for loss functions with memory

IEEE Transactions on Information Theory

Online Markov Decision Processes

Mathematics of Operations Research
Online learning in Markov decision processes with arbitrarily changing rewards and transitions

GameNets'09 Proceedings of the First ICST international conference on Game Theory for Networks
On-Line Sequential Bin Packing

The Journal of Machine Learning Research
Near-optimal Regret Bounds for Reinforcement Learning

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a learning problem where the decision maker interacts with a standard Markov decision process, with the exception that the reward functions vary arbitrarily over time. We show that, against every possible realization of the reward process, the agent can perform as well---in hindsight---as every stationary policy. This generalizes the classical no-regret result for repeated games. Specifically, we present an efficient online algorithm---in the spirit of reinforcement learning---that ensures that the agent's average performance loss vanishes over time, provided that the environment is oblivious to the agent's actions. Moreover, it is possible to modify the basic algorithm to cope with instances where reward observations are limited to the agent's trajectory. We present further modifications that reduce the computational cost by using function approximation and that track the optimal policy through infrequent changes.