Optimal adaptive policies for Markov decision processes
Mathematics of Operations Research
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Near-Optimal Reinforcement Learning in Polynomial Time
Machine Learning
R-max - a general polynomial time algorithm for near-optimal reinforcement learning
The Journal of Machine Learning Research
A theoretical analysis of Model-Based Interval Estimation
ICML '05 Proceedings of the 22nd international conference on Machine learning
Prediction, Learning, and Games
Prediction, Learning, and Games
PAC model-free reinforcement learning
ICML '06 Proceedings of the 23rd international conference on Machine learning
Online regret bounds for Markov decision processes with deterministic transitions
Theoretical Computer Science
Near-optimal Regret Bounds for Reinforcement Learning
The Journal of Machine Learning Research
Regret bounds for restless markov bandits
ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
The Journal of Machine Learning Research
Hi-index | 0.00 |
We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of Õ(HS√AT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.