Max-cover in map-reduce

Authors:
Flavio Chierichetti;Ravi Kumar;Andrew Tomkins
Affiliations:
Univ. of Rome, Rome, Italy;Yahoo! Inc., Sunnyvale, USA;Google, Inc., Mountain View, USA
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 26
Cited 14

Efficient NC algorithms for set cover with applications to learning and geometry

Proceedings of the 30th IEEE symposium on Foundations of computer science
Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
A threshold of ln n for approximating set cover

Journal of the ACM (JACM)
The budgeted maximum coverage problem

Information Processing Letters
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Maximizing the spread of influence through a social network

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximation algorithms for partial covering problems

Journal of Algorithms
Algorithmic construction of sets for k-restrictions

ACM Transactions on Algorithms (TALG)
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
Cost-effective outbreak detection in networks

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Connectivity structure of bipartite graphs via the KNC-plot

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
On distributing symmetric streaming computations

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
MapReduce for Data Intensive Scientific Analyses

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Fast Counting of Triangles in Large Real Networks without Counting: Algorithms and Laws

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Efficient influence maximization in social networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
DOULION: counting triangles in massive graphs with a coin

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
On single-pass indexing with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Graph Twiddling in a MapReduce World

Computing in Science and Engineering
Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Exploring large-data issues in the curriculum: a case study with MapReduce

TeachCL '08 Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
Ranking and semi-supervised classification on large scale graphs using map-reduce

TextGraphs-4 Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing
A model of computation for MapReduce

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms

Set cover algorithms for very large datasets

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
An algorithmic treatment of strong queries

Proceedings of the fourth ACM international conference on Web search and data mining
Social content matching in MapReduce

Proceedings of the VLDB Endowment
Linear-work greedy parallel approximate set cover and variants

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
On scheduling in map-reduce and flow-shops

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Fast clustering using MapReduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Densest subgraph in streaming and MapReduce

Proceedings of the VLDB Endowment
Scalable k-means++

Proceedings of the VLDB Endowment
Space-round tradeoffs for MapReduce computations

Proceedings of the 26th ACM international conference on Supercomputing
Parallel and I/O efficient set covering algorithms

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Proceedings of the 21st ACM international conference on Information and knowledge management
Computing n-gram statistics in MapReduce

Proceedings of the 16th International Conference on Extending Database Technology
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast greedy algorithms in mapreduce and streaming

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

The NP-hard Max-k-cover problem requires selecting k sets from a collection so as to maximize the size of the union. This classic problem occurs commonly in many settings in web search and advertising. For moderately-sized instances, a greedy algorithm gives an approximation of (1-1/e). However, the greedy algorithm requires updating scores of arbitrary elements after each step, and hence becomes intractable for large datasets. We give the first max cover algorithm designed for today's large-scale commodity clusters. Our algorithm has provably almost the same approximation as greedy, but runs much faster. Furthermore, it can be easily expressed in the MapReduce programming paradigm, and requires only polylogarithmically many passes over the data. Our experiments on five large problem instances show that our algorithm is practical and can achieve good speedups compared to the sequential greedy algorithm.