Spectral analysis for billion-scale graphs: discoveries and implementation

Authors:
U. Kang;Brendan Meeder;Christos Faloutsos
Affiliations:
Carnegie Mellon University, School of Computer Science;Carnegie Mellon University, School of Computer Science;Carnegie Mellon University, School of Computer Science
Venue:
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Year:
2011

Citing 13
Cited 11

Applied numerical linear algebra

Applied numerical linear algebra
PLAPACK: parallel linear algebra package design overview

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Google's MapReduce programming model – Revisited

Science of Computer Programming
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Parallel Spectral Clustering

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Scalable Tensor Decompositions for Multi-aspect Data Mining

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Fast Counting of Triangles in Large Real Networks without Counting: Algorithms and Laws

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
DOULION: counting triangles in massive graphs with a coin

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
An implementation of parallel eigenvalue computation using dual-level hybrid parallelism

ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
Mining large graphs: Algorithms, inference, and discoveries

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
EigenSpokes: surprising patterns and scalable community chipping in large graphs

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II

On the duality of data-intensive file system design: reconciling HDFS and PVFS

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Managing and mining large graphs: patterns and algorithms

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
OPAvion: mining and visualization in large graphs

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Thought leaders during crises in massive social networks

Statistical Analysis and Data Mining
Parallel and I/O efficient set covering algorithms

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Using R for iterative and incremental processing

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment
Presto: distributed machine learning and graph processing with sparse matrices

Proceedings of the 8th ACM European Conference on Computer Systems
Big graph mining: algorithms and discoveries

ACM SIGKDD Explorations Newsletter
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there nodes that participate in too many or too few triangles? Are there close-knit near-cliques? These questions are expensive to answer unless we have the first several eigenvalues and eigenvectors of the graph adjacency matrix. However, eigensolvers suffer from subtle problems (e.g., convergence) for large sparse matrices, let alone for billion-scale ones. We address this problem with the proposed HEIGEN algorithm, which we carefully design to be accurate, efficient, and able to run on the highly scalable MAPREDUCE (HADOOP) environment. This enables HEIGEN to handle matrices more than 1000× larger than those which can be analyzed by existing algorithms. We implement HEIGEN and run it on the M45 cluster, one of the top 50 supercomputers in the world. We report important discoveries about near-cliques and triangles on several real-world graphs, including a snapshot of the Twitter social network (38Gb, 2 billion edges) and the "YahooWeb" dataset, one of the largest publicly available graphs (120Gb, 1.4 billion nodes, 6.6 billion edges).