Technical Note: \cal Q-Learning
Machine Learning
The weighted majority algorithm
Information and Computation
Some perturbation theory for linear programming
Mathematical Programming: Series A and B
Competitive Markov decision processes
Competitive Markov decision processes
Machine Learning - Special issue on context sensitivity and concept drift
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning
SIAM Journal on Control and Optimization
Dynamic Programming and Optimal Control
Dynamic Programming and Optimal Control
Neuro-Dynamic Programming
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
The empirical Bayes envelope and regret minimization in competitive Markov decision processes
Mathematics of Operations Research
R-max - a general polynomial time algorithm for near-optimal reinforcement learning
The Journal of Machine Learning Research
Efficient algorithms for online decision problems
Journal of Computer and System Sciences - Special issue: Learning theory 2003
Prediction, Learning, and Games
Prediction, Learning, and Games
On sequential strategies for loss functions with memory
IEEE Transactions on Information Theory
Online Markov Decision Processes
Mathematics of Operations Research
Online learning in Markov decision processes with arbitrarily changing rewards and transitions
GameNets'09 Proceedings of the First ICST international conference on Game Theory for Networks
On-Line Sequential Bin Packing
The Journal of Machine Learning Research
Near-optimal Regret Bounds for Reinforcement Learning
The Journal of Machine Learning Research
Hi-index | 0.00 |
We consider a learning problem where the decision maker interacts with a standard Markov decision process, with the exception that the reward functions vary arbitrarily over time. We show that, against every possible realization of the reward process, the agent can perform as well---in hindsight---as every stationary policy. This generalizes the classical no-regret result for repeated games. Specifically, we present an efficient online algorithm---in the spirit of reinforcement learning---that ensures that the agent's average performance loss vanishes over time, provided that the environment is oblivious to the agent's actions. Moreover, it is possible to modify the basic algorithm to cope with instances where reward observations are limited to the agent's trajectory. We present further modifications that reduce the computational cost by using function approximation and that track the optimal policy through infrequent changes.