Optimal adaptive policies for Markov decision processes
Mathematics of Operations Research
Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems
Proceedings of the 36th annual ACM/IEEE Design Automation Conference
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time
Machine Learning
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
The Journal of Machine Learning Research
Pseudometrics for State Aggregation in Average Reward Markov Decision Processes
ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Improved rates for the stochastic continuum-armed bandit problem
COLT'07 Proceedings of the 20th annual conference on Learning theory
Polynomial value iteration algorithms for deterministic MDPs
UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Faster maximum and minimum mean cycle algorithms for system-performance analysis
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Hi-index | 0.00 |
We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the onlineregret (with respect to an (茂戮驴-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptoticbounds for the general MDP setting. We also present corresponding lower bounds. As an application, multi-armed bandits with switching cost are considered.