Optimal adaptive policies for Markov decision processes
Mathematics of Operations Research
Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems
Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Markov Decision Processes: Discrete Stochastic Dynamic Programming
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time
Machine Learning
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
The Journal of Machine Learning Research
Improved rates for the stochastic continuum-armed bandit problem
COLT'07 Proceedings of the 20th annual conference on Learning theory
REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs
UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Near-optimal Regret Bounds for Reinforcement Learning
The Journal of Machine Learning Research
Polynomial value iteration algorithms for deterministic MDPs
UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Faster maximum and minimum mean cycle algorithms for system-performance analysis
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Hi-index | 5.23 |
We consider an upper confidence bound algorithm for learning in Markov decision processes with deterministic transitions. For this algorithm we derive upper bounds on the online regret with respect to an (@e-)optimal policy that are logarithmic in the number of steps taken. We also present a corresponding lower bound. As an application, multi-armed bandits with switching cost are considered.