Online regret bounds for Markov decision processes with deterministic transitions

Authors:
Ronald Ortner
Affiliations:
Department Mathematik und Informationstechnologie, Montanuniversität Leoben, A-8700 Leoben, Austria
Venue:
Theoretical Computer Science
Year:
2010

Citing 12
Cited 0

Optimal adaptive policies for Markov decision processes

Mathematics of Operations Research
Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Improved rates for the stochastic continuum-armed bandit problem

COLT'07 Proceedings of the 20th annual conference on Learning theory
REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Near-optimal Regret Bounds for Reinforcement Learning

The Journal of Machine Learning Research
Polynomial value iteration algorithms for deterministic MDPs

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Faster maximum and minimum mean cycle algorithms for system-performance analysis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Quantified Score

Hi-index	5.23

Visualization

Abstract

We consider an upper confidence bound algorithm for learning in Markov decision processes with deterministic transitions. For this algorithm we derive upper bounds on the online regret with respect to an (@e-)optimal policy that are logarithmic in the number of steps taken. We also present a corresponding lower bound. As an application, multi-armed bandits with switching cost are considered.