Counting triangles and the curse of the last reducer

Authors:
Siddharth Suri;Sergei Vassilvitskii
Affiliations:
Yahoo! Research, New York, NY, USA;Yahoo! Research, New York, NY, USA
Venue:
Proceedings of the 20th international conference on World wide web
Year:
2011

Citing 12
Cited 25

Arboricity and subgraph listing algorithms

SIAM Journal on Computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
An improved data stream algorithm for frequency moments

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Counting triangles in data streams

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Planetary-scale views on a large instant-messaging network

Proceedings of the 17th international conference on World Wide Web
Efficient semi-streaming algorithms for local triangle counting in massive graphs

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
DOULION: counting triangles in massive graphs with a coin

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
What is Twitter, a social network or a news media?

Proceedings of the 19th international conference on World wide web
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
A model of computation for MapReduce

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms

Triangle listing in massive networks and its applications

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Colorful triangle counting and a MapReduce implementation

Information Processing Letters
Densest subgraph in streaming and MapReduce

Proceedings of the VLDB Endowment
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
Truss decomposition in massive networks

Proceedings of the VLDB Endowment
Designing good MapReduce algorithms

XRDS: Crossroads, The ACM Magazine for Students - Big Data
Triangle listing in massive networks

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011
PowerGraph: distributed graph-parallel computation on natural graphs

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
GraphChi: large-scale graph computation on just a PC

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Communication steps for parallel query processing

Proceedings of the 32nd symposium on Principles of database systems
Early experiences in using a domain-specific language for large-scale graph analysis

First International Workshop on Graph Data Management Experiences and Systems
A space efficient streaming algorithm for triangle counting using the birthday paradox

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Upper and lower bounds on the cost of a map-reduce computation

Proceedings of the VLDB Endowment
Distributed large-scale natural graph factorization

Proceedings of the 22nd international conference on World Wide Web
PATRIC: a parallel algorithm for counting triangles in massive networks

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
An efficient MapReduce algorithm for counting triangles in a very large graph

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Parallel triangle counting in massive streaming graphs

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Big data begets big database theory

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Why do simple algorithms for triangle enumeration work in the real world?

Proceedings of the 5th conference on Innovations in theoretical computer science
Counting and sampling triangles from a graph stream

Proceedings of the VLDB Endowment
Dimension independent similarity computation

The Journal of Machine Learning Research
Skew strikes back: new developments in the theory of join algorithms

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

The clustering coefficient of a node in a social network is a fundamental measure that quantifies how tightly-knit the community is around the node. Its computation can be reduced to counting the number of triangles incident on the particular node in the network. In case the graph is too big to fit into memory, this is a non-trivial task, and previous researchers showed how to estimate the clustering coefficient in this scenario. A different avenue of research is to to perform the computation in parallel, spreading it across many machines. In recent years MapReduce has emerged as a de facto programming paradigm for parallel computation on massive data sets. The main focus of this work is to give MapReduce algorithms for counting triangles which we use to compute clustering coefficients. Our contributions are twofold. First, we describe a sequential triangle counting algorithm and show how to adapt it to the MapReduce setting. This algorithm achieves a factor of 10-100 speed up over the naive approach. Second, we present a new algorithm designed specifically for the MapReduce framework. A key feature of this approach is that it allows for a smooth tradeoff between the memory available on each individual machine and the total memory available to the algorithm, while keeping the total work done constant. Moreover, this algorithm can use any triangle counting algorithm as a black box and distribute the computation across many machines. We validate our algorithms on real world datasets comprising of millions of nodes and over a billion edges. Our results show both algorithms effectively deal with skew in the degree distribution and lead to dramatic speed ups over the naive implementation.