PAC Bounds for Multi-armed Bandit and Markov Decision Processes

Authors:
Eyal Even-Dar;Shie Mannor;Yishay Mansour
Affiliations:
-;-;-
Venue:
COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Year:
2002

Citing 12
Cited 21

Technical Note: \cal Q-Learning

Machine Learning
Asynchronous Stochastic Approximation and Q-Learning

Machine Learning
PAC adaptive control of linear systems

COLT '97 Proceedings of the tenth annual conference on Computational learning theory
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning

SIAM Journal on Control and Optimization
Finite-sample convergence rates for Q-learning and indirect algorithms

Proceedings of the 1998 conference on Advances in neural information processing systems II
Learning in Neural Networks: Theoretical Foundations

Learning in Neural Networks: Theoretical Foundations
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Near-Optimal Reinforcement Learning in Polynominal Time

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Learning Rates for Q-Learning

COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
R-MAX: a general polynomial time algorithm for near-optimal reinforcement learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Active model selection

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Finite time bounds for sampling based fitted value iteration

ICML '05 Proceedings of the 22nd international conference on Machine learning
An adaptive algorithm for selecting profitable keywords for search-based advertising services

EC '06 Proceedings of the 7th ACM conference on Electronic commerce
An incentive-compatible multi-armed bandit mechanism

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Efficient PAC Learning for Episodic Tasks with Acyclic State Spaces

Discrete Event Dynamic Systems
Empirical Bernstein stopping

Proceedings of the 25th international conference on Machine learning
Finite-Time Bounds for Fitted Value Iteration

The Journal of Machine Learning Research
An analysis of model-based Interval Estimation for Markov Decision Processes

Journal of Computer and System Sciences
Computational modelling of switching behaviour in repeated gambles

Artificial Intelligence Review
Adaptive Incentive-Compatible Sponsored Search Auction

SOFSEM '09 Proceedings of the 35th Conference on Current Trends in Theory and Practice of Computer Science
An adaptive sponsored search mechanism δ-gain truthful in valuation, time, and budget

WINE'07 Proceedings of the 3rd international conference on Internet and network economics
Pure exploration in multi-armed bandits problems

ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Pure exploration in finitely-armed and continuous-armed bandits

Theoretical Computer Science
Learning to trade off between exploration and exploitation in multiclass bandit prediction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Hierarchical Knowledge Gradient for Sequential Sampling

The Journal of Machine Learning Research
Multi-armed bandit algorithms and empirical evaluation

ECML'05 Proceedings of the 16th European conference on Machine Learning
DCOPs and bandits: exploration and exploitation in decentralised coordination

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
A contextual-bandit algorithm for mobile context-aware recommender system

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part III
Exploration / exploitation trade-off in mobile context-aware recommender systems

AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence
Sample complexity of risk-averse bandit-arm selection

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/驴2 log 1/驴) times to find an 驴-optimal arm with probability of at least 1 - 驴. This is in contrast to the naive bound of O(n/驴2 log n/驴). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound. We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learningalg orithm for Markov Decision Processes. This is done essentially by simulatingV alue Iteration, and in each iteration invokingt he multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.