Fast clustering using MapReduce

Authors:
Alina Ene;Sungjin Im;Benjamin Moseley
Affiliations:
University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 23
Cited 10

Rounding via trees: deterministic approximation algorithms for group Steiner trees and k-median

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
On approximating arbitrary metrices by tree metrics

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
A new greedy approach for facility location problems

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A constant-factor approximation algorithm for the k-median problem

Journal of Computer and System Sciences - STOC 1999
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Improved Combinatorial Algorithms for the Facility Location and k-Median Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Probabilistic approximation of metric spaces and its algorithmic applications

FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Local Search Heuristics for k-Median and Facility Location Problems

SIAM Journal on Computing
Incremental Clustering and Dynamic Information Retrieval

SIAM Journal on Computing
Quick k-Median, k-Center, and Facility Location for Sparse Graphs

SIAM Journal on Computing
k-means projective clustering

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On k-Median clustering in high dimensions

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
A constant factor approximation algorithm for k-median clustering with outliers

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity

APPROX '08 / RANDOM '08 Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization: Algorithms and Techniques
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Parallel approximation algorithms for facility-location problems

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
On distributing symmetric streaming computations

ACM Transactions on Algorithms (TALG)
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
A model of computation for MapReduce

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms

Scalable k-means++

Proceedings of the VLDB Endowment
An architecture for component-based design of representative-based clustering algorithms

Data & Knowledge Engineering
A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach

Computers in Biology and Medicine
Parallel rough set based knowledge acquisition using MapReduce from big data

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast greedy algorithms in mapreduce and streaming

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Cloud MapReduce for particle filter-based data assimilation for wildfire spread simulation

Proceedings of the High Performance Computing Symposium
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
A fast algorithm for clustering with mapreduce

ISNN'13 Proceedings of the 10th international conference on Advances in Neural Networks - Volume Part I
A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems

International Journal of Approximate Reasoning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in MRC0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the k-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.