CC-MR --- finding connected components in huge graphs with mapreduce

Authors:
Thomas Seidl;Brigitte Boden;Sergej Fries
Affiliations:
Data Management and Data Exploration Group, RWTH Aachen University, Germany;Data Management and Data Exploration Group, RWTH Aachen University, Germany;Data Management and Data Exploration Group, RWTH Aachen University, Germany
Venue:
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Year:
2012

Citing 10
Cited 0

An O(n2 log n) parallel max-flow algorithm

Journal of Algorithms
A comparison of parallel algorithms for connected components

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Efficient parallel algorithms for some graph problems

Communications of the ACM
Computing connected components on parallel computers

Communications of the ACM
A Parallel Algorithm for Connected Components on Distributed Memory Machines

Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Graph Twiddling in a MapReduce World

Computing in Science and Engineering
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Cloud-based Connected Component Algorithm

AICI '10 Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence - Volume 03
Filtering: a method for solving graph problems in MapReduce

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

The detection of connected components in graphs is a well-known problem arising in a large number of applications including data mining, analysis of social networks, image analysis and a lot of other related problems. In spite of the existing very efficient serial algorithms, this problem remains a subject of research due to increasing data amounts produced by modern information systems which cannot be handled by single workstations. Only highly parallelized approaches on multi-core-servers or computer clusters are able to deal with these large-scale data sets. In this work we present a solution for this problem for distributed memory architectures, and provide an implementation for the well-known MapReduce framework developed by Google. Our algorithm CC-MR significantly outperforms the existing approaches for the MapReduce framework in terms of the number of necessary iterations, communication costs and execution runtime, as we show in our experimental evaluation on synthetic and real-world data. Furthermore, we present a technique for accelerating our implementation for datasets with very heterogeneous component sizes as they often appear in real data sets.