Multi-armed bandit problems with dependent arms

Authors:
Sandeep Pandey;Deepayan Chakrabarti;Deepak Agarwal
Affiliations:
Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA
Venue:
Proceedings of the 24th international conference on Machine learning
Year:
2007

Citing 3
Cited 12

Information-based objective functions for active data selection

Neural Computation
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning

Integration of news content into web results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Predicting bounce rates in sponsored search advertisements

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Linearly Parameterized Bandits

Mathematics of Operations Research
A Monte Carlo knowledge gradient method for learning abatement potential of emissions reduction technologies

Winter Simulation Conference
The Knowledge Gradient Algorithm for a General Class of Online Learning Problems

Operations Research
LogUCB: an explore-exploit algorithm for comments recommendation

Proceedings of the 21st ACM international conference on Information and knowledge management
Sequential selection of correlated ads by POMDPs

Proceedings of the 21st ACM international conference on Information and knowledge management
Combinatorial network optimization with unknown variables: multi-armed bandits with linear rewards and individual observations

IEEE/ACM Transactions on Networking (TON)
Interactive exploratory search for multi page search results

Proceedings of the 22nd international conference on World Wide Web
Mixing bandits: a recipe for improved cold-start recommendations in a social network

Proceedings of the 7th Workshop on Social Network Mining and Analysis
A learning approach to optimizing exploration---exploitation tradeoff in relevance feedback

Information Retrieval
Online learning for auction mechanism in bandit setting

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We provide a framework to exploit dependencies among arms in multi-armed bandit problems, when the dependencies are in the form of a generative model on clusters of arms. We find an optimal MDP-based policy for the discounted reward case, and also give an approximation of it with formal error guarantee. We discuss lower bounds on regret in the undiscounted reward scenario, and propose a general two-level bandit policy for it. We propose three different instantiations of our general policy and provide theoretical justifications of how the regret of the instantiated policies depend on the characteristics of the clusters. Finally, we empirically demonstrate the efficacy of our policies on large-scale real-world and synthetic data, and show that they significantly outperform classical policies designed for bandits with independent arms.