Finite-time Analysis of the Multiarmed Bandit Problem

Authors:
Peter Auer;Nicolò Cesa-Bianchi;Paul Fischer
Affiliations:
University of Technology Graz, A-8010 Graz, Austria. pauer@igi.tu-graz.ac.at;DTI, University of Milan, via Bramante 65, I-26013 Crema, Italy. cesa-bianchi@dti.unimi.it;Lehrstuhl Informatik II, Universität Dortmund, D-44221 Dortmund, Germany. fischer@ls2.informatik.uni-dortmund.de
Venue:
Machine Learning
Year:
2002

Citing 6
Cited 219

Asymptotically efficient adaptive control in stochastic regression models

Advances in Applied Mathematics
Nonparametric bandit methods

Annals of Operations Research
Adaptation in natural and artificial systems

Adaptation in natural and artificial systems
Multi-armed bandit problem revisited

Journal of Optimization Theory and Applications
Regular Article: Optimal Adaptive Policies for Sequential Allocation Problems

Advances in Applied Mathematics
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
An adaptive algorithm for selecting profitable keywords for search-based advertising services

EC '06 Proceedings of the 7th ACM conference on Electronic commerce
On-line evolutionary computation for reinforcement learning in stochastic domains

Proceedings of the 8th annual conference on Genetic and evolutionary computation
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
Evolutionary Function Approximation for Reinforcement Learning

The Journal of Machine Learning Research
Combining online and offline knowledge in UCT

Proceedings of the 24th international conference on Machine learning
Multi-armed bandit problems with dependent arms

Proceedings of the 24th international conference on Machine learning
Empirical Studies in Action Selection with Reinforcement Learning

Adaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems
Dynamic cost-per-action mechanisms and applications to online advertising

Proceedings of the 17th international conference on World Wide Web
Multi-armed bandits in metric spaces

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Adaptive operator selection with dynamic multi-armed bandits

Proceedings of the 10th annual conference on Genetic and evolutionary computation
Exploration scavenging

Proceedings of the 25th international conference on Machine learning
Learning diverse rankings with multi-armed bandits

Proceedings of the 25th international conference on Machine learning
Sample-based learning and search with permanent and transient memories

Proceedings of the 25th international conference on Machine learning
Rollout sampling approximate policy iteration

Machine Learning
Tuning Bandit Algorithms in Stochastic Environments

ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Extreme Value Based Adaptive Operator Selection

Proceedings of the 10th international conference on Parallel Problem Solving from Nature: PPSN X
Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

ALT '08 Proceedings of the 19th international conference on Algorithmic Learning Theory
Active Learning in Multi-armed Bandits

ALT '08 Proceedings of the 19th international conference on Algorithmic Learning Theory
Learning to play Go using recursive neural networks

Neural Networks
Algorithms and Bounds for Rollout Sampling Approximate Policy Iteration

Recent Advances in Reinforcement Learning
Optimistic Planning of Deterministic Systems

Recent Advances in Reinforcement Learning
Knowledge Generation for Improving Simulations in UCT for General Game Playing

AI '08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
Approximation algorithms for restless bandit problems

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Improving the Exploration Strategy in Bandit Algorithms

Learning and Intelligent Optimization
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

Theoretical Computer Science
Adaptive bidding for display advertising

Proceedings of the 18th international conference on World wide web
A Nonparametric Asymptotic Analysis of Inventory Planning with Censored Demand

Mathematics of Operations Research
To create neuro-controlled game opponent from UCT-created data

Proceedings of the first ACM/SIGEVO Summit on Genetic and Evolutionary Computation
Bandit-based optimization on graphs with application to library performance tuning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Piecewise-stationary bandit problems with side observations

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Characterizing truthful multi-armed bandit mechanisms: extended abstract

Proceedings of the 10th ACM conference on Electronic commerce
The price of truthfulness for pay-per-click auctions

Proceedings of the 10th ACM conference on Electronic commerce
Adaptive play in Texas Hold'em Poker

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Analysis of adaptive operator selection techniques on the royal road and long k-path problems

Proceedings of the 11th Annual conference on Genetic and evolutionary computation
Extreme: dynamic multi-armed bandits for adaptive operator selection

Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers
An Efficient and Adaptive Mechanism for Parallel Simulation Replication

PADS '09 Proceedings of the 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation
Efficient Multi-start Strategies for Local Search Algorithms

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
The max K-armed bandit: a new model of exploration applied to search heuristic selection

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Simulation-based approach to general game playing

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 1
Continuous time associative bandit problems

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A machine learning approach for statistical software testing

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Optimal contraction theorem for exploration-exploitation tradeoff in search and optimization

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
UCT for tactical assault planning in real-time strategy games

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Extreme compass and dynamic multi-armed bandits for adaptive operator selection

CEC'09 Proceedings of the Eleventh conference on Congress on Evolutionary Computation
Coevolving intelligent game players in a cultural framework

CEC'09 Proceedings of the Eleventh conference on Congress on Evolutionary Computation
Monte-Carlo Tree Search in Poker Using Expected Reward Distributions

ACML '09 Proceedings of the 1st Asian Conference on Machine Learning: Advances in Machine Learning
Regret Minimization and Job Scheduling

SOFSEM '10 Proceedings of the 36th Conference on Current Trends in Theory and Practice of Computer Science
Monte Carlo search applied to card selection in magic: the gathering

CIG'09 Proceedings of the 5th international conference on Computational Intelligence and Games
A study on security evaluation methodology for image-based biometrics authentication systems

BTAS'09 Proceedings of the 3rd IEEE international conference on Biometrics: Theory, applications and systems
Improved rates for the stochastic continuum-armed bandit problem

COLT'07 Proceedings of the 20th annual conference on Learning theory
A contextual-bandit approach to personalized news article recommendation

Proceedings of the 19th international conference on World wide web
Bandit-based Monte-Carlo planning for the single-machine total weighted tardiness scheduling problem

EUROCAST'07 Proceedings of the 11th international conference on Computer aided systems theory
Structural statistical software testing with active learning in a graph

ILP'07 Proceedings of the 17th international conference on Inductive logic programming
Backpropagation modification in Monte-Carlo game tree search

IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
To create intelligent adaptive neuro-controller of game opponent from UCT-created data

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 2
Reward-modulated hebbian learning of decision making

Neural Computation
Truthful mechanisms with implicit payment computation

Proceedings of the 11th ACM conference on Electronic commerce
Online regret bounds for Markov decision processes with deterministic transitions

Theoretical Computer Science
Active learning in heteroscedastic noise

Theoretical Computer Science
Pure exploration in multi-armed bandits problems

ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Toward comparison-based adaptive operator selection

Proceedings of the 12th annual conference on Genetic and evolutionary computation
Fitness-AUC bandit adaptive strategy selection vs. the probability matching one within differential evolution: an empirical comparison on the bbob-2010 noiseless testbed

Proceedings of the 12th annual conference companion on Genetic and evolutionary computation
Opportunistic spectrum access with multiple users: learning under competition

INFOCOM'10 Proceedings of the 29th conference on Information communications
Linearly Parameterized Bandits

Mathematics of Operations Research
A mean-based approach for real-time planning

Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1
Combining active learning and reactive control for robot grasping

Robotics and Autonomous Systems
Near-optimal Regret Bounds for Reinforcement Learning

The Journal of Machine Learning Research
Approximation algorithms for restless bandit problems

Journal of the ACM (JACM)
Sharp dichotomies for regret minimization in metric spaces

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Tug-of-war model for multi-armed bandit problem

UC'10 Proceedings of the 9th international conference on Unconventional computation
Comparison-based adaptive strategy selection with bandits in differential evolution

PPSN'10 Proceedings of the 11th international conference on Parallel problem solving from nature: Part I
Exploration-exploitation of eye movement enriched multiple feature spaces for content-based image retrieval

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
A minimum relative entropy principle for learning and acting

Journal of Artificial Intelligence Research
Distributed learning in multi-armed bandit with multiple players

IEEE Transactions on Signal Processing
Bandit-based estimation of distribution algorithms for noisy optimization: rigorous runtime analysis

LION'10 Proceedings of the 4th international conference on Learning and intelligent optimization
Consistency modifications for automatically tuned Monte-Carlo tree search

LION'10 Proceedings of the 4th international conference on Learning and intelligent optimization
Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint

Operations Research
Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Proceedings of the fourth ACM international conference on Web search and data mining
Value of learning in sponsored search auctions

WINE'10 Proceedings of the 6th international conference on Internet and network economics
Solving non-stationary bandit problems by random sampling from sibling Kalman filters

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part III
On the scalability of parallel UCT

CG'10 Proceedings of the 7th international conference on Computers and games
Biasing Monte-Carlo simulations through RAVE values

CG'10 Proceedings of the 7th international conference on Computers and games
Node-expansion operators for the UCT algorithm

CG'10 Proceedings of the 7th international conference on Computers and games
Regret Bounds and Minimax Policies under Partial Monitoring

The Journal of Machine Learning Research
Pure exploration in finitely-armed and continuous-armed bandits

Theoretical Computer Science
Computer poker: A review

Artificial Intelligence
Sampled fictitious play for approximate dynamic programming

Computers and Operations Research
Prior kowledge in larning fnite prameter saces

FG'09 Proceedings of the 14th international conference on Formal grammar
Not all parents are equal for MO-CMA-ES

EMO'11 Proceedings of the 6th international conference on Evolutionary multi-criterion optimization
Monte-Carlo tree search and rapid action value estimation in computer Go

Artificial Intelligence
Automating the runtime performance evaluation of simulation algorithms

Winter Simulation Conference
A Monte Carlo knowledge gradient method for learning abatement potential of emissions reduction technologies

Winter Simulation Conference
Approximating n-player behavioural strategy nash equilibria using coevolution

Proceedings of the 13th annual conference on Genetic and evolutionary computation
Policy learning in resource-constrained optimization

Proceedings of the 13th annual conference on Genetic and evolutionary computation
The road to VEGAS: guiding the search over neutral networks

Proceedings of the 13th annual conference on Genetic and evolutionary computation
Analyzing bandit-based adaptive operator selection mechanisms

Annals of Mathematics and Artificial Intelligence
A dynamic programming strategy to balance exploration and exploitation in the bandit problem

Annals of Mathematics and Artificial Intelligence
Off-line and on-line tuning: a study on operator selection for a memetic algorithm applied to the QAP

EvoCOP'11 Proceedings of the 11th European conference on Evolutionary computation in combinatorial optimization
Selecting Simulation Algorithm Portfolios by Genetic Algorithms

PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Click shaping to optimize multiple objectives

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to trade off between exploration and exploitation in multiclass bandit prediction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Better Algorithms for Benign Bandits

The Journal of Machine Learning Research
X-Armed Bandits

The Journal of Machine Learning Research
Multi-agent Monte Carlo Go

The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Learning the demand curve in posted-price digital goods auctions

The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Multigame playing by means of UCT enhanced with automatically generated evaluation functions

AGI'11 Proceedings of the 4th international conference on Artificial general intelligence
Parallel Monte-Carlo tree search for HPC systems

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
A selecting-the-best method for budgeted model selection

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
ShareBoost: boosting for multi-view learning with performance guarantees

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Poster: selftuning batching in total order broadcast via analytical modelling and reinforcement learning

ACM SIGMETRICS Performance Evaluation Review - Special Issue on IFIP PERFORMANCE 2011- 29th International Symposium on Computer Performance, Modeling, Measurement and Evaluation
Personalized pricing recommender system: multi-stage epsilon-greedy approach

Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems
Monte-carlo style UCT search for boolean satisfiability

AI*IA'11 Proceedings of the 12th international conference on Artificial intelligence around man and beyond
Lipschitz bandits without the Lipschitz constant

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Deviations of stochastic bandit regret

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
On upper-confidence bound policies for switching bandit problems

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Asymptotically optimal agents

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Efficient multi-start strategies for local search algorithms

Journal of Artificial Intelligence Research
Hierarchical Knowledge Gradient for Sequential Sampling

The Journal of Machine Learning Research
Dynamic channel, rate selection and scheduling for white spaces

Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies
Multi-armed bandits with episode context

Annals of Mathematics and Artificial Intelligence
A simple distribution-free approach to the max k-armed bandit problem

CP'06 Proceedings of the 12th international conference on Principles and Practice of Constraint Programming
Bandit based monte-carlo planning

ECML'06 Proceedings of the 17th European conference on Machine Learning
The grand challenge of computer Go: Monte Carlo tree search and extensions

Communications of the ACM
Reinforcement learning based sensing policy optimization for energy efficient cognitive radio networks

Neurocomputing
Multiple overlapping tiles for contextual monte carlo tree search

EvoApplicatons'10 Proceedings of the 2010 international conference on Applications of Evolutionary Computation - Volume Part I
Multi-armed bandit algorithms and empirical evaluation

ECML'05 Proceedings of the 16th European conference on Machine Learning
Modelling empathic behaviour in a robotic game companion for children: an ethnographic study in real-world settings

HRI '12 Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction
Creating an upper-confidence-tree program for havannah

ACG'09 Proceedings of the 12th international conference on Advances in Computer Games
Chasing a Moving Target: Exploitation and Exploration in Dynamic Environments

Management Science
Managing Delegated Search Over Design Spaces

Management Science
Bandit-Based genetic programming

EuroGP'10 Proceedings of the 13th European conference on Genetic Programming
Modelling empathy in social robotic companions

UMAP'11 Proceedings of the 19th international conference on Advances in User Modeling
Group recommendations via multi-armed bandits

Proceedings of the 21st international conference companion on World Wide Web
Dynamical information retrieval modelling: a portfolio-armed bandit machine approach

Proceedings of the 21st international conference companion on World Wide Web
PolyCert: polymorphic self-optimizing replication for in-memory transactional grids

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Computing approximate Nash Equilibria and robust best-responses using sampling

Journal of Artificial Intelligence Research
Dynamic pricing with limited supply

Proceedings of the 13th ACM Conference on Electronic Commerce
A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities

Proceedings of the 13th ACM Conference on Electronic Commerce
The K-armed dueling bandits problem

Journal of Computer and System Sciences
Evolutionary operator self-adaptation with diverse operators

EuroGP'12 Proceedings of the 15th European conference on Genetic Programming
Hyperparameter tuning in bandit-based adaptive operator selection

EvoApplications'12 Proceedings of the 2012t European conference on Applications of Evolutionary Computation
On combining decisions from multiple expert imitators for performance

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One
Online planning for ad hoc autonomous agent teams

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One
Real-time solving of quantified CSPs based on Monte-Carlo game tree search

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One
Learning data transformation rules through examples: preliminary results

Proceedings of the Ninth International Workshop on Information Integration on the Web
The Knowledge Gradient Algorithm for a General Class of Online Learning Problems

Operations Research
Automatic discovery of ranking formulas for playing with multi-armed bandits

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Guiding combinatorial optimization with UCT

CPAIOR'12 Proceedings of the 9th international conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems
DCOPs and bandits: exploration and exploitation in decentralised coordination

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Just add Pepper: extending learning algorithms for repeated matrix games to repeated Markov games

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Optimistic Bayesian sampling in contextual-bandit problems

The Journal of Machine Learning Research
Playing repeated Stackelberg games with unknown opponents

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 2
Personalized click shaping through lagrangian duality for online recommendation

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Dynamic randomization and domain knowledge in Monte-Carlo Tree Search for Go knowledge-based systems

Knowledge-Based Systems
Dynamic Pricing Under a General Parametric Choice Model

Operations Research
A diversity dilemma in evolutionary markets

Proceedings of the 13th International Conference on Electronic Commerce
LogUCB: an explore-exploit algorithm for comments recommendation

Proceedings of the 21st ACM international conference on Information and knowledge management
Sequential selection of correlated ads by POMDPs

Proceedings of the 21st ACM international conference on Information and knowledge management
Multi-armed bandit formulation of the task partitioning problem in swarm robotics

ANTS'12 Proceedings of the 8th international conference on Swarm Intelligence
Adaptive operator selection at the hyper-level

PPSN'12 Proceedings of the 12th international conference on Parallel Problem Solving from Nature - Volume Part II
Bootstrapping monte carlo tree search with an imperfect heuristic

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Thompson sampling: an asymptotically optimal finite-time analysis

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Regret bounds for restless markov bandits

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Confidence bounds for statistical model checking of probabilistic hybrid systems

FORMATS'12 Proceedings of the 10th international conference on Formal Modeling and Analysis of Timed Systems
PolyCert: polymorphic self-optimizing replication for in-memory transactional grids

Proceedings of the 12th International Middleware Conference
Lessons Learned from 15 Years of Operations Research for French TV Channel TF1

Interfaces
Learning and incentives in user-generated content: multi-armed bandits with endogenous arms

Proceedings of the 4th conference on Innovations in Theoretical Computer Science
New algorithms for budgeted learning

Machine Learning
A contextual-bandit algorithm for mobile context-aware recommender system

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part III
Pilot, rollout and monte carlo tree search methods for job shop scheduling

LION'12 Proceedings of the 6th international conference on Learning and Intelligent Optimization
Upper confidence tree-based consistent reactive planning application to minesweeper

LION'12 Proceedings of the 6th international conference on Learning and Intelligent Optimization
Bandit-based structure learning for bayesian network classifiers

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part II
Combinatorial network optimization with unknown variables: multi-armed bandits with linear rewards and individual observations

IEEE/ACM Transactions on Networking (TON)
Reusing historical interaction data for faster online learning to rank for IR

Proceedings of the sixth ACM international conference on Web search and data mining
Exploration / exploitation trade-off in mobile context-aware recommender systems

AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence
A hybrid differential evolution algorithm for job shop scheduling problems with expected total tardiness criterion

Applied Soft Computing
TEXPLORE: real-time sample-efficient reinforcement learning for robots

Machine Learning
Content recommendation on web portals

Communications of the ACM
Sustainable cooperative coevolution with a multi-armed bandit

Proceedings of the 15th annual conference on Genetic and evolutionary computation
Dynamic Pay-Per-Action Mechanisms and Applications to Online Advertising

Operations Research
Micro adaptivity in Vectorwise

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Multi-parameter mechanisms with implicit payment computation

Proceedings of the fourteenth ACM conference on Electronic commerce
Distributed Gibbs: a memory-bounded sampling-based DCOP algorithm

Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Online implicit agent modelling

Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Learning exploration strategies in model-based reinforcement learning

Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Learning in real-time in repeated games using experts

Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
A generic adaptive simulation algorithm for component-based simulation systems

Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Active search on graphs

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Truthful incentives in crowdsourcing tasks using regret minimization mechanisms

Proceedings of the 22nd international conference on World Wide Web
Mixing bandits: a recipe for improved cold-start recommendations in a social network

Proceedings of the 7th Workshop on Social Network Mining and Analysis
Lower bounds and selectivity of weak-consistent policies in stochastic multi-armed bandit problem

The Journal of Machine Learning Research
Ranked bandits in metric spaces: learning diverse rankings over large document collections

The Journal of Machine Learning Research
Optimal discovery with probabilistic expert advice: finite time analysis and macroscopic optimality

The Journal of Machine Learning Research
Interactive collaborative filtering

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Evaluating simulation software components with player rating systems

Proceedings of the 6th International ICST Conference on Simulation Tools and Techniques
Design and parametric considerations for artificial neural network pruning in UCT game playing

Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
Automatic ad format selection via contextual bandits

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Scheduling black-box mutational fuzzing

Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security
Online Learning for Personalized Room-Level Thermal Control: A Multi-Armed Bandit Framework

Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings
OM Forum---Operations Management Challenges for Some “Cleantech” Firms

Manufacturing & Service Operations Management
Monte-Carlo expectation maximization for decentralized POMDPs

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Sufficiency-based selection strategy for MCTS

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Monte Carlo *-minimax search

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Lazy paired hyper-parameter tuning

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Lifelong learning for acquiring the wisdom of the crowd

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Using reinforcement learning to find an optimal set of features

Computers & Mathematics with Applications
LASER: a scalable response prediction platform for online advertising

Proceedings of the 7th ACM international conference on Web search and data mining
Relative confidence sampling for efficient on-line ranker evaluation

Proceedings of the 7th ACM international conference on Web search and data mining
Monte-Carlo tree search for Bayesian reinforcement learning

Applied Intelligence
Online learning for auction mechanism in bandit setting

Decision Support Systems
Robustness of stochastic bandit policies

Theoretical Computer Science
Counterfactual reasoning and learning systems: the example of computational advertising

The Journal of Machine Learning Research
Machine learning in an auction environment

Proceedings of the 23rd international conference on World wide web
Efficient bidding strategies for Cliff-Edge problems

Autonomous Agents and Multi-Agent Systems
BoostingTree: parallel selection of weak learners in boosting, with application to ranking

Machine Learning
Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search

Journal of Artificial Intelligence Research
Adaptive crawler for external hyperlinks search and acquisition

Automation and Remote Control
A tour of machine learning: An AI perspective

AI Communications - ECAI 2012 Turing and Anniversary Track

Quantified Score

Hi-index	0.04

Visualization

Abstract

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.