Set cover algorithms for very large datasets

Authors:
Graham Cormode;Howard Karloff;Anthony Wirth
Affiliations:
AT&T Labs - Research, Florham Park, NJ, USA;AT&T Labs - Research, Florham Park, NJ, USA;The University of Melbourne, Parkville, Australia
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 8
Cited 3

Efficient NC algorithms for set cover with applications to learning and geometry

Proceedings of the 30th IEEE symposium on Foundations of computer science
A threshold of ln n for approximating set cover

Journal of the ACM (JACM)
Using association rules for product assortment decisions: a case study

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Set Cover with Requirements and Costs Evolving over Time

RANDOM-APPROX '99 Proceedings of the Third International Workshop on Approximation Algorithms for Combinatorial Optimization Problems: Randomization, Approximation, and Combinatorial Algorithms and Techniques
Experimental analysis of approximation algorithms for the vertex cover and set covering problems

Computers and Operations Research
On generating near-optimal tableaux for conditional functional dependencies

Proceedings of the VLDB Endowment
Approximation algorithms for combinatorial problems

Journal of Computer and System Sciences
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web

SCARAB: scaling reachability computation on large graphs

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Parallel and I/O efficient set covering algorithms

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Fast greedy algorithms in mapreduce and streaming

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of Set Cover - to find the smallest subcollection of sets that covers some universe - is at the heart of many data and analysis tasks. It arises in a wide range of settings, including operations research, machine learning, planning, data quality and data mining. Although finding an optimal solution is NP-hard, the greedy algorithm is widely used, and typically finds solutions that are close to optimal. However, a direct implementation of the greedy approach, which picks the set with the largest number of uncovered items at each step, does not behave well when the input is very large and disk resident. The greedy algorithm must make many random accesses to disk, which are unpredictable and costly in comparison to linear scans. In order to scale Set Cover to large datasets, we provide a new algorithm which finds a solution that is provably close to that of greedy, but which is much more efficient to implement using modern disk technology. Our experiments show a ten-fold improvement in speed on moderately-sized datasets, and an even greater improvement on larger datasets.