The max K-armed bandit: a new model of exploration applied to search heuristic selection

Authors:
Vincent A. Cicirello;Stephen F. Smith
Affiliations:
Department of Computer Science, Drexel University, Philadelphia, PA;The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA
Venue:
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Year:
2005

Citing 9
Cited 13

Adaptation in natural and artificial systems

Adaptation in natural and artificial systems
The Continuum-Armed Bandit Problem

SIAM Journal on Control and Optimization
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
A Constraint-Based Method for Project Scheduling with Time Windows

Journal of Heuristics
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
An Iterated Dynasearch Algorithm for the Single-Machine Total Weighted Tardiness Scheduling Problem

INFORMS Journal on Computing
Enhancing Stochastic Search Performance by Value-Biased Randomization of Heuristics

Journal of Heuristics
An effective algorithm for project scheduling with arbitrary temporal constraints

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence

Reinforcement learning for active model selection

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Efficiently determining the appropriate mix of personal interaction and reputation information in partner choice

Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems - Volume 2
Bandit-based optimization on graphs with application to library performance tuning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
An asymptotically optimal algorithm for the max k-armed bandit problem

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Efficient Multi-start Strategies for Local Search Algorithms

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Learning restart strategies

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A decision-theoretic formalism for belief-optimal reasoning

PerMIS '09 Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems
Algorithm selection as a bandit problem with unbounded losses

LION'10 Proceedings of the 4th international conference on Learning and intelligent optimization
Dynamic sample budget allocation in model-based optimization

Journal of Global Optimization
Efficient multi-start strategies for local search algorithms

Journal of Artificial Intelligence Research
A simple distribution-free approach to the max k-armed bandit problem

CP'06 Proceedings of the 12th international conference on Principles and Practice of Constraint Programming
Pilot, rollout and monte carlo tree search methods for job shop scheduling

LION'12 Proceedings of the 6th international conference on Learning and Intelligent Optimization
Efficiently gathering information in costly domains

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The multiarmed bandit is often used as an analogy for the tradeoff between exploration and exploitation in search problems. The classic problem involves allocating trials to the arms of a multiarmed slot machine to maximize the expected sum of rewards. We pose a new variation of the multiarmed bandit--the Max K-Armed Bandit--in which trials must be allocated among the arms to maximize the expected best single sample reward of the series of trials. Motivation for the Max K-Armed Bandit is the allocation of restarts among a set of multistart stochastic search algorithms. We present an analysis of this Max K-Armed Bandit showing under certain assumptions that the optimal strategy allocates trials to the observed best arm at a rate increasing double exponentially relative to the other arms. This motivates an exploration strategy that follows a Boltzmann distribution with an exponentially decaying temperature parameter. We compare this exploration policy to policies that allocate trials to the observed best arm at rates faster (and slower) than double exponentially. The results confirm, for two scheduling domains, that the double exponential increase in the rate of allocations to the observed best heuristic outperfonns the other approaches.