An Upper Bound on the Loss from Approximate Optimal-Value Functions
Machine Learning
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning
SIAM Journal on Control and Optimization
Finite-sample convergence rates for Q-learning and indirect algorithms
Proceedings of the 1998 conference on Advances in neural information processing systems II
Buffer overflow management in QoS switches
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Reinforcement Learning
Neuro-Dynamic Programming
The Nonstochastic Multiarmed Bandit Problem
SIAM Journal on Computing
Kernel-Based Reinforcement Learning
Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time
Machine Learning
Approximately Optimal Approximate Reinforcement Learning
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Gambling in a rigged casino: The adversarial multi-armed bandit problem
FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
The Journal of Machine Learning Research
Finite time bounds for sampling based fitted value iteration
ICML '05 Proceedings of the 22nd international conference on Machine learning
Proceedings of the 25th international conference on Machine learning
Rollout sampling approximate policy iteration
Machine Learning
Algorithms and Bounds for Rollout Sampling Approximate Policy Iteration
Recent Advances in Reinforcement Learning
Piecewise-stationary bandit problems with side observations
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
The offset tree for learning with partial labels
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Customized learning algorithms for episodic tasks withacyclic state spaces
CASE'09 Proceedings of the fifth annual IEEE international conference on Automation science and engineering
Autonomous Agents and Multi-Agent Systems
Learning to trade off between exploration and exploitation in multiclass bandit prediction
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Multi-armed bandits with episode context
Annals of Mathematics and Artificial Intelligence
Nearly optimal exploration-exploitation decision thresholds
ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities
Proceedings of the 13th ACM Conference on Electronic Commerce
The K-armed dueling bandits problem
Journal of Computer and System Sciences
Hi-index | 0.00 |
We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O((n/ε2)log(1/δ)) times to find an ε-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.