Network flows: theory, algorithms, and applications
Network flows: theory, algorithms, and applications
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
Using confidence bounds for exploitation-exploration trade-offs
The Journal of Machine Learning Research
Adaptive routing with end-to-end feedback: distributed learning and geometric approaches
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
An adaptive algorithm for selecting profitable keywords for search-based advertising services
EC '06 Proceedings of the 7th ACM conference on Electronic commerce
Enabling distributed throughput maximization in wireless mesh networks: a partitioning approach
Proceedings of the 12th annual international conference on Mobile computing and networking
Multi-armed bandit problems with dependent arms
Proceedings of the 24th international conference on Machine learning
Combinatorial Optimization: Theory and Algorithms
Combinatorial Optimization: Theory and Algorithms
Linearly Parameterized Bandits
Mathematics of Operations Research
Distributed learning in multi-armed bandit with multiple players
IEEE Transactions on Signal Processing
IEEE Journal on Selected Areas in Communications
Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret
IEEE Journal on Selected Areas in Communications
Push the limit of wireless network capacity: a tale of cognitive and coexistence
Proceedings of the 1st ACM workshop on Cognitive radio architectures for broadband
Online Learning for Personalized Room-Level Thermal Control: A Multi-Armed Bandit Framework
Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings
Hi-index | 0.00 |
We formulate the following combinatorial multiarmed bandit (MAB) problem: There are N random variables with unknown mean that are each instantiated in an i.i.d. fashion over time. At each time multiple random variables can be selected, subject to an arbitrary constraint on weights associated with the selected variables. All of the selected individual random variables are observed at that time, and a linearly weighted combination of these selected variables is yielded as the reward. The goal is to find a policy that minimizes regret, defined as the difference between the reward obtained by a genie that knows the mean of each random variable, and that obtained by the given policy. This formulation is broadly applicable and useful for stochastic online versions of many interesting tasks in networks that can be formulated as tractable combinatorial optimization problems with linear objective functions, such as maximum weighted matching, shortest path, and minimum spanning tree computations. Prior work on multi-armed bandits with multiple plays cannot be applied to this formulation because of the general nature of the constraint. On the other hand, the mapping of all feasible combinations to arms allows for the use of prior work on MAB with single-play, but results in regret, storage, and computation growing exponentially in the number of unknown variables. We present new efficient policies for this problem that are shown to achieve regret that grows logarithmically with time, and polynomially in the number of unknown variables. Furthermore, these policies only require storage that grows linearly in the number of unknown parameters. For problems where the underlying deterministic problem is tractable, these policies further require only polynomial computation. For computationally intractable problems, we also present results on a different notion of regret that is suitable when a polynomial-time approximation algorithm is used.