REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs

Authors:
Peter L. Bartlett;Ambuj Tewari
Affiliations:
University of California at Berkeley, Berkeley, CA;Toyota Technological Institute at Chicago, Chicago, IL
Venue:
UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Year:
2009

Citing 7
Cited 5

Optimal adaptive policies for Markov decision processes

Mathematics of Operations Research
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
R-max - a general polynomial time algorithm for near-optimal reinforcement learning

The Journal of Machine Learning Research
A theoretical analysis of Model-Based Interval Estimation

ICML '05 Proceedings of the 22nd international conference on Machine learning
Prediction, Learning, and Games

Prediction, Learning, and Games
PAC model-free reinforcement learning

ICML '06 Proceedings of the 23rd international conference on Machine learning

Online regret bounds for Markov decision processes with deterministic transitions

Theoretical Computer Science
Near-optimal Regret Bounds for Reinforcement Learning

The Journal of Machine Learning Research
Regret bounds for restless markov bandits

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Dynamic policy programming

The Journal of Machine Learning Research
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of Õ(HS√AT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.