Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

Authors:
Ronald Ortner
Affiliations:
University of Leoben, Leoben, Austria A-8700
Venue:
ALT '08 Proceedings of the 19th international conference on Algorithmic Learning Theory
Year:
2008

Citing 10
Cited 0

Optimal adaptive policies for Markov decision processes

Mathematics of Operations Research
Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Pseudometrics for State Aggregation in Average Reward Markov Decision Processes

ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Improved rates for the stochastic continuum-armed bandit problem

COLT'07 Proceedings of the 20th annual conference on Learning theory
Polynomial value iteration algorithms for deterministic MDPs

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Faster maximum and minimum mean cycle algorithms for system-performance analysis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the onlineregret (with respect to an (茂戮驴-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptoticbounds for the general MDP setting. We also present corresponding lower bounds. As an application, multi-armed bandits with switching cost are considered.