GraphBuilder: scalable graph ETL framework

Authors:
Nilesh Jain;Guangdeng Liao;Theodore L. Willke
Affiliations:
Systems Architecture Lab, Intel Corporation, Hillsboro, OR;Systems Architecture Lab, Intel Corporation, Hillsboro, OR;Systems Architecture Lab, Intel Corporation, Hillsboro, OR
Venue:
First International Workshop on Graph Data Management Experiences and Systems
Year:
2013

Citing 14
Cited 0

Parallel multilevel k-way partitioning scheme for irregular graphs

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
New spectral bounds on k-partitioning of graphs

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Balanced graph partitioning

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Statistical properties of community structure in large social and information networks

Proceedings of the 17th international conference on World Wide Web
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Kronecker Graphs: An Approach to Modeling Networks

The Journal of Machine Learning Research
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Multilevel algorithms for partitioning power-law graphs

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
HAMA: An Efficient Matrix Computation with the MapReduce Framework

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Kineograph: taking the pulse of a fast-changing and connected world

Proceedings of the 7th ACM european conference on Computer Systems
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
Streaming graph partitioning for large distributed graphs

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
PowerGraph: distributed graph-parallel computation on natural graphs

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Massive streaming data analytics: a graph-based approach

XRDS: Crossroads, The ACM Magazine for Students - Scientific Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graph abstraction is essential for many applications from finding a shortest path to executing complex machine learning (ML) algorithms like collaborative filtering. Graph construction from raw data for various applications is becoming challenging, due to exponential growth in data, as well as the need for large scale graph processing. Since graph construction is a data-parallel problem, MapReduce is well-suited for this task. We developed GraphBuilder, a scalable framework for graph Extract-Transform-Load (ETL), to offload many of the complexities of graph construction, including graph formation, tabulation, transformation, partitioning, output formatting, and serialization. GraphBuilder is written in Java, for ease of programming, and it scales using the MapReduce model. In this paper, we describe the motivation for GraphBuilder, its architecture, MapReduce algorithms, and performance evaluation of the framework. Since large graphs should be partitioned over a cluster for storing and processing and partitioning methods have significant performance impacts, we develop several graph partitioning methods and evaluate their performance. We also open source the framework at https://01.org/graphbuilder/.