Arboricity and subgraph listing algorithms
SIAM Journal on Computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
An improved data stream algorithm for frequency moments
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Counting triangles in data streams
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Planetary-scale views on a large instant-messaging network
Proceedings of the 17th international conference on World Wide Web
Efficient semi-streaming algorithms for local triangle counting in massive graphs
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
DOULION: counting triangles in massive graphs with a coin
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Hadoop: The Definitive Guide
What is Twitter, a social network or a news media?
Proceedings of the 19th international conference on World wide web
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
A model of computation for MapReduce
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Triangle listing in massive networks and its applications
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Colorful triangle counting and a MapReduce implementation
Information Processing Letters
Densest subgraph in streaming and MapReduce
Proceedings of the VLDB Endowment
Distributed GraphLab: a framework for machine learning and data mining in the cloud
Proceedings of the VLDB Endowment
Truss decomposition in massive networks
Proceedings of the VLDB Endowment
Designing good MapReduce algorithms
XRDS: Crossroads, The ACM Magazine for Students - Big Data
Triangle listing in massive networks
ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011
PowerGraph: distributed graph-parallel computation on natural graphs
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
GraphChi: large-scale graph computation on just a PC
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Designing good algorithms for MapReduce and beyond
Proceedings of the Third ACM Symposium on Cloud Computing
Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis
ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Communication steps for parallel query processing
Proceedings of the 32nd symposium on Principles of database systems
Early experiences in using a domain-specific language for large-scale graph analysis
First International Workshop on Graph Data Management Experiences and Systems
A space efficient streaming algorithm for triangle counting using the birthday paradox
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Upper and lower bounds on the cost of a map-reduce computation
Proceedings of the VLDB Endowment
Distributed large-scale natural graph factorization
Proceedings of the 22nd international conference on World Wide Web
PATRIC: a parallel algorithm for counting triangles in massive networks
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
An efficient MapReduce algorithm for counting triangles in a very large graph
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Parallel triangle counting in massive streaming graphs
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Big data begets big database theory
BNCOD'13 Proceedings of the 29th British National conference on Big Data
Why do simple algorithms for triangle enumeration work in the real world?
Proceedings of the 5th conference on Innovations in theoretical computer science
Counting and sampling triangles from a graph stream
Proceedings of the VLDB Endowment
Dimension independent similarity computation
The Journal of Machine Learning Research
Skew strikes back: new developments in the theory of join algorithms
ACM SIGMOD Record
Hi-index | 0.00 |
The clustering coefficient of a node in a social network is a fundamental measure that quantifies how tightly-knit the community is around the node. Its computation can be reduced to counting the number of triangles incident on the particular node in the network. In case the graph is too big to fit into memory, this is a non-trivial task, and previous researchers showed how to estimate the clustering coefficient in this scenario. A different avenue of research is to to perform the computation in parallel, spreading it across many machines. In recent years MapReduce has emerged as a de facto programming paradigm for parallel computation on massive data sets. The main focus of this work is to give MapReduce algorithms for counting triangles which we use to compute clustering coefficients. Our contributions are twofold. First, we describe a sequential triangle counting algorithm and show how to adapt it to the MapReduce setting. This algorithm achieves a factor of 10-100 speed up over the naive approach. Second, we present a new algorithm designed specifically for the MapReduce framework. A key feature of this approach is that it allows for a smooth tradeoff between the memory available on each individual machine and the total memory available to the algorithm, while keeping the total work done constant. Moreover, this algorithm can use any triangle counting algorithm as a black box and distribute the computation across many machines. We validate our algorithms on real world datasets comprising of millions of nodes and over a billion edges. Our results show both algorithms effectively deal with skew in the degree distribution and lead to dramatic speed ups over the naive implementation.