Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Authors:
Eyal Even-Dar;Shie Mannor;Yishay Mansour
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2006

Citing 13
Cited 14

An Upper Bound on the Loss from Approximate Optimal-Value Functions

Machine Learning
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning

SIAM Journal on Control and Optimization
Finite-sample convergence rates for Q-learning and indirect algorithms

Proceedings of the 1998 conference on Advances in neural information processing systems II
Buffer overflow management in QoS switches

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Reinforcement Learning

Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Kernel-Based Reinforcement Learning

Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
Approximately Optimal Approximate Reinforcement Learning

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Finite time bounds for sampling based fitted value iteration

ICML '05 Proceedings of the 22nd international conference on Machine learning

Exploration scavenging

Proceedings of the 25th international conference on Machine learning
Rollout sampling approximate policy iteration

Machine Learning
Algorithms and Bounds for Rollout Sampling Approximate Policy Iteration

Recent Advances in Reinforcement Learning
Piecewise-stationary bandit problems with side observations

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
The offset tree for learning with partial labels

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Customized learning algorithms for episodic tasks withacyclic state spaces

CASE'09 Proceedings of the fifth annual IEEE international conference on Automation science and engineering
Automated bidding in computational markets: an application in market-based allocation of computing services

Autonomous Agents and Multi-Agent Systems
Learning to trade off between exploration and exploitation in multiclass bandit prediction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Multi-armed bandits with episode context

Annals of Mathematics and Artificial Intelligence
Nearly optimal exploration-exploitation decision thresholds

ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities

Proceedings of the 13th ACM Conference on Electronic Commerce
The K-armed dueling bandits problem

Journal of Computer and System Sciences
Performance Guarantees for Empirical Markov Decision Processes with Applications to Multiperiod Inventory Models

Operations Research
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O((n/ε2)log(1/δ)) times to find an ε-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.