PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

Authors:
U. Kang;Charalampos E. Tsourakakis;Christos Faloutsos
Affiliations:
-;-;-
Venue:
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Year:
2009

Citing 0
Cited 80

Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Large graph processing in the cloud

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Design patterns for efficient graph algorithms in MapReduce

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
A Very Fast Method for Clustering Big Text Datasets

Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
SGDB: simple graph database optimized for activation spreading computation

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
HADI: Mining Radii of Large Graphs

ACM Transactions on Knowledge Discovery from Data (TKDD)
Social content matching in MapReduce

Proceedings of the VLDB Endowment
Fast personalized PageRank on MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Towards efficient subgraph search in cloud computing environments

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Crunching large graphs with commodity processors

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Clustering very large multi-dimensional datasets with MapReduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Diversified ranking on large graphs: an optimization viewpoint

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
GBASE: a scalable and general graph management system

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Spectral analysis for billion-scale graphs: discoveries and implementation

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Graph-based data warehousing using the core-facets model

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
Unifying guilt-by-association approaches: theorems and fast algorithms

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Making time-stepped applications tick in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
A distributed look-up architecture for text mining applications using MapReduce

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable manipulation of archival web graphs

Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
ParallelGDB: a parallel graph database based on cache specialization

Proceedings of the 15th Symposium on International Database Engineering & Applications
Kineograph: taking the pulse of a fast-changing and connected world

Proceedings of the 7th ACM european conference on Computer Systems
Matrix chain multiplication via multi-way join algorithms in MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
RDFPath: path query processing on large RDF graphs with mapreduce

ESWC'11 Proceedings of the 8th international conference on The Semantic Web
iMapReduce: A Distributed Computing Framework for Iterative Computation

Journal of Grid Computing
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
Managing large dynamic graphs efficiently

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Towards effective partition management for large graphs

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Managing and mining large graphs: patterns and algorithms

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Managing and mining large graphs: systems and implementations

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
OPAvion: mining and visualization in large graphs

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Flexible and efficient distributed resolution of large entities

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
MapReduce in MPI for Large-scale graph algorithms

Parallel Computing
Accelerate large-scale iterative computation through asynchronous accumulative updates

Proceedings of the 3rd workshop on Scientific Cloud Computing Date
Highly scalable graph search for the Graph500 benchmark

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Distributed approximate spectral clustering for large-scale datasets

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Personalized news recommendation: a review and an experimental investigation

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
BC-PDM: data mining, social network analysis and text mining system based on cloud computing

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
MapReduce for parallel reinforcement learning

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Spinning fast iterative data flows

Proceedings of the VLDB Endowment
Delta-SimRank computing on MapReduce

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
A parallel graph partitioning algorithm to speed up the large-scale distributed graph mining

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Efficient graph management based on bitmap indices

Proceedings of the 16th International Database Engineering & Applications Sysmposium
On computing the diameter of real-world directed (weighted) graphs

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
gbase: an efficient analysis platform for large graphs

The VLDB Journal — The International Journal on Very Large Data Bases
PowerGraph: distributed graph-parallel computation on natural graphs

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
GraphChi: large-scale graph computation on just a PC

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
Improving large graph processing on partitioned graphs in the cloud

Proceedings of the Third ACM Symposium on Cloud Computing
CC-MR --- finding connected components in huge graphs with mapreduce

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Expanders, tropical semi-rings, and nuclear norms: oh my!

XRDS: Crossroads, The ACM Magazine for Students - Scientific Computing
Exploiting and Evaluating MapReduce for Large-Scale Graph Mining

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
Cumulon: optimizing statistical data analysis in the cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Mizan: a system for dynamic load balancing in large-scale graph processing

Proceedings of the 8th ACM European Conference on Computer Systems
Trinity: a distributed graph engine on a memory cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Big graph mining: algorithms and discoveries

ACM SIGKDD Explorations Newsletter
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems
GraphBuilder: scalable graph ETL framework

First International Workshop on Graph Data Management Experiences and Systems
GPS: a graph processing system

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast anomaly detection despite the duplicates

Proceedings of the 22nd international conference on World Wide Web companion
A first view of exedra: a domain-specific language for large graph analytics workflows

Proceedings of the 22nd international conference on World Wide Web companion
Distributed community detection in web-scale networks

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Towards systematic parallel programming of graph problems via tree decomposition and tree parallelism

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
An efficient MapReduce algorithm for counting triangles in a very large graph

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
"All roads lead to Rome": optimistic recovery for distributed iterative data processing

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Combination of in-memory graph computation with mapreduce: a subgraph-centric method of pagerank

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Database research challenges and opportunities of big graph data

BNCOD'13 Proceedings of the 29th British National conference on Big Data
BDMPI: conquering BigData with small clusters using MPI

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
FENNEL: streaming graph partitioning for massive scale graphs

Proceedings of the 7th ACM international conference on Web search and data mining
PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment
Fast iterative graph computation with block updates

Proceedings of the VLDB Endowment
Parallel processing of large graphs

Future Generation Computer Systems
Exploiting inter-operation parallelism for matrix chain multiplication using MapReduce

The Journal of Supercomputing
WOOster: a map-reduce based platform for graph mining

Proceedings of the 17th International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with 6,7 billion edges.