Mining large graphs: Algorithms, inference, and discoveries

Authors:
U Kang;Duen Horng Chau;Christos Faloutsos
Affiliations:
School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh PA 15213, United States;School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh PA 15213, United States;School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh PA 15213, United States
Venue:
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Year:
2011

Citing 0
Cited 5

Spectral analysis for billion-scale graphs: discoveries and implementation

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Unifying guilt-by-association approaches: theorems and fast algorithms

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient ad-hoc search for personalized PageRank

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Horton+: a distributed system for processing declarative reachability queries over partitioned graphs

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

How do we find patterns and anomalies, on graphs with billions of nodes and edges, which do not fit in memory? How to use parallelism for such terabyte-scale graphs? In this work, we focus on inference, which often corresponds, intuitively, to "guilt by association" scenarios. For example, if a person is a drug-abuser, probably its friends are so, too; if a node in a social network is of male gender, his dates are probably females. We show how to do inference on such huge graphs through our proposed HAdoop Line graph Fixed Point (Ha-Lfp), an efficient parallel algorithm for sparse billion-scale graphs, using the Hadoop platform. Our contributions include (a) the design of Ha-Lfp, observing that it corresponds to a fixed point on a line graph induced from the original graph; (b) scalability analysis, showing that our algorithm scales up well with the number of edges, as well as with the number of machines; and (c) experimental results on two private, as well as two of the largest publicly available graphs--the Web Graphs from Yahoo! (6.6 billion edges and 0.24 Tera bytes), and the Twitter graph (3.7 billion edges and 0.13 Tera bytes). We evaluated our algorithm using M45, one of the top 50 fastest supercomputers in the world, and we report patterns and anomalies discovered by our algorithm, which would be invisible otherwise.