Combinatorial network optimization with unknown variables: multi-armed bandits with linear rewards and individual observations

Authors:
Yi Gai;Bhaskar Krishnamachari;Rahul Jain
Affiliations:
Department of Electrical Engineering, University of Southern California, Los Angeles, CA;Department of Electrical Engineering, University of Southern California, Los Angeles, CA;Department of Electrical Engineering, University of Southern California, Los Angeles, CA
Venue:
IEEE/ACM Transactions on Networking (TON)
Year:
2012

Citing 12
Cited 2

Network flows: theory, algorithms, and applications

Network flows: theory, algorithms, and applications
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Using confidence bounds for exploitation-exploration trade-offs

The Journal of Machine Learning Research
Adaptive routing with end-to-end feedback: distributed learning and geometric approaches

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
An adaptive algorithm for selecting profitable keywords for search-based advertising services

EC '06 Proceedings of the 7th ACM conference on Electronic commerce
Enabling distributed throughput maximization in wireless mesh networks: a partitioning approach

Proceedings of the 12th annual international conference on Mobile computing and networking
Multi-armed bandit problems with dependent arms

Proceedings of the 24th international conference on Machine learning
Combinatorial Optimization: Theory and Algorithms

Combinatorial Optimization: Theory and Algorithms
Linearly Parameterized Bandits

Mathematics of Operations Research
Distributed learning in multi-armed bandit with multiple players

IEEE Transactions on Signal Processing
The distance-2 matching problem and its relationship to the MAC-Layer capacity of ad hoc wireless networks

IEEE Journal on Selected Areas in Communications
Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret

IEEE Journal on Selected Areas in Communications

Push the limit of wireless network capacity: a tale of cognitive and coexistence

Proceedings of the 1st ACM workshop on Cognitive radio architectures for broadband
Online Learning for Personalized Room-Level Thermal Control: A Multi-Armed Bandit Framework

Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings

Quantified Score

Hi-index	0.00

Visualization

Abstract

We formulate the following combinatorial multiarmed bandit (MAB) problem: There are N random variables with unknown mean that are each instantiated in an i.i.d. fashion over time. At each time multiple random variables can be selected, subject to an arbitrary constraint on weights associated with the selected variables. All of the selected individual random variables are observed at that time, and a linearly weighted combination of these selected variables is yielded as the reward. The goal is to find a policy that minimizes regret, defined as the difference between the reward obtained by a genie that knows the mean of each random variable, and that obtained by the given policy. This formulation is broadly applicable and useful for stochastic online versions of many interesting tasks in networks that can be formulated as tractable combinatorial optimization problems with linear objective functions, such as maximum weighted matching, shortest path, and minimum spanning tree computations. Prior work on multi-armed bandits with multiple plays cannot be applied to this formulation because of the general nature of the constraint. On the other hand, the mapping of all feasible combinations to arms allows for the use of prior work on MAB with single-play, but results in regret, storage, and computation growing exponentially in the number of unknown variables. We present new efficient policies for this problem that are shown to achieve regret that grows logarithmically with time, and polynomially in the number of unknown variables. Furthermore, these policies only require storage that grows linearly in the number of unknown parameters. For problems where the underlying deterministic problem is tractable, these policies further require only polynomial computation. For computationally intractable problems, we also present results on a different notion of regret that is suitable when a polynomial-time approximation algorithm is used.