Near-optimal Regret Bounds for Reinforcement Learning

Authors:
Thomas Jaksch;Ronald Ortner;Peter Auer
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2010

Citing 18
Cited 12

Efficient reinforcement learning

COLT '94 Proceedings of the seventh annual conference on Computational learning theory
Optimal adaptive policies for Markov decision processes

Mathematics of Operations Research
Finite-sample convergence rates for Q-learning and indirect algorithms

Proceedings of the 1998 conference on Advances in neural information processing systems II
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
R-max - a general polynomial time algorithm for near-optimal reinforcement learning

The Journal of Machine Learning Research
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
A theoretical analysis of Model-Based Interval Estimation

ICML '05 Proceedings of the 22nd international conference on Machine learning
PAC model-free reinforcement learning

ICML '06 Proceedings of the 23rd international conference on Machine learning
An analysis of model-based Interval Estimation for Markov Decision Processes

Journal of Computer and System Sciences
Piecewise-stationary bandit problems with side observations

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Online Markov Decision Processes

Mathematics of Operations Research
Markov Decision Processes with Arbitrary Reward Processes

Mathematics of Operations Research
Bounded parameter Markov decision processes with average reward criterion

COLT'07 Proceedings of the 20th annual conference on Learning theory
REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

Online regret bounds for Markov decision processes with deterministic transitions

Theoretical Computer Science
Reducing reinforcement learning to KWIK online regression

Annals of Mathematics and Artificial Intelligence
Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

The Journal of Machine Learning Research
Models for autonomously motivated exploration in reinforcement learning

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Robust bayesian reinforcement learning through tight lower bounds

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
V-MAX: tempered optimism for better PAC reinforcement learning

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Regret bounds for restless markov bandits

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
PAC bounds for discounted MDPs

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Dynamic policy programming

The Journal of Machine Learning Research
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Machine Learning
Linear Bayesian reinforcement learning

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of l times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of Õ(l1/3T2/3DS√A).