Rounding via trees: deterministic approximation algorithms for group Steiner trees and k-median
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
On approximating arbitrary metrices by tree metrics
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
A new greedy approach for facility location problems
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A constant-factor approximation algorithm for the k-median problem
Journal of Computer and System Sciences - STOC 1999
Clustering Data Streams: Theory and Practice
IEEE Transactions on Knowledge and Data Engineering
Improved Combinatorial Algorithms for the Facility Location and k-Median Problems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Probabilistic approximation of metric spaces and its algorithmic applications
FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Local Search Heuristics for k-Median and Facility Location Problems
SIAM Journal on Computing
Incremental Clustering and Dynamic Information Retrieval
SIAM Journal on Computing
Quick k-Median, k-Center, and Facility Location for Sparse Graphs
SIAM Journal on Computing
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On k-Median clustering in high dimensions
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
k-means++: the advantages of careful seeding
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
A constant factor approximation algorithm for k-median clustering with outliers
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity
APPROX '08 / RANDOM '08 Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization: Algorithms and Techniques
Hadoop: The Definitive Guide
Proceedings of the 19th international conference on World wide web
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Parallel approximation algorithms for facility-location problems
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
On distributing symmetric streaming computations
ACM Transactions on Algorithms (TALG)
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
A model of computation for MapReduce
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Proceedings of the VLDB Endowment
An architecture for component-based design of representative-based clustering algorithms
Data & Knowledge Engineering
Computers in Biology and Medicine
Parallel rough set based knowledge acquisition using MapReduce from big data
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast greedy algorithms in mapreduce and streaming
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Cloud MapReduce for particle filter-based data assimilation for wildfire spread simulation
Proceedings of the High Performance Computing Symposium
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
A fast algorithm for clustering with mapreduce
ISNN'13 Proceedings of the 10th international conference on Advances in Neural Networks - Volume Part I
International Journal of Approximate Reasoning
Hi-index | 0.00 |
Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in MRC0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the k-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.