Distributed GraphLab: a framework for machine learning and data mining in the cloud

Authors:
Yucheng Low;Danny Bickson;Joseph Gonzalez;Carlos Guestrin;Aapo Kyrola;Joseph M. Hellerstein
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;UC Berkeley
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 25
Cited 45

Parallel and distributed computation: numerical methods

Parallel and distributed computation: numerical methods
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Multilevel k-way partitioning scheme for irregular graphs

Journal of Parallel and Distributed Computing
A view of the EM algorithm that justifies incremental, sparse, and other variants

Proceedings of the NATO Advanced Study Institute on Learning in graphical models
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Detecting termination of distributed computations using markers

PODC '83 Proceedings of the second annual ACM symposium on Principles of distributed computing
Criticality and parallelism in combinatorial optimization

Criticality and parallelism in combinatorial optimization
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Survey of graph database models

ACM Computing Surveys (CSUR)
Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Large-Scale Parallel Collaborative Filtering for the Netflix Prize

AAIM '08 Proceedings of the 4th international conference on Algorithmic Aspects in Information and Management
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Distributed parallel inference on large factor graphs

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Large graph processing in the cloud

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A common substrate for cluster computing

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An architecture for parallel topic models

Proceedings of the VLDB Endowment
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Counting triangles and the curse of the last reducer

Proceedings of the 20th international conference on World wide web
Filtering: a method for solving graph problems in MapReduce

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing

Transparent user models for personalization

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Scalable similarity-based neighborhood methods with MapReduce

Proceedings of the sixth ACM conference on Recommender systems
PowerGraph: distributed graph-parallel computation on natural graphs

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
GraphChi: large-scale graph computation on just a PC

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Facilitating real-time graph mining

Proceedings of the fourth international workshop on Cloud data management
Coflow: a networking abstraction for cluster applications

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Improving large graph processing on partitioned graphs in the cloud

Proceedings of the Third ACM Symposium on Cloud Computing
Ligra: a lightweight graph processing framework for shared memory

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Cloud driven design of a distributed genetic programming platform

EvoApplications'13 Proceedings of the 16th European conference on Applications of Evolutionary Computation
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Machine learning for big data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Mizan: a system for dynamic load balancing in large-scale graph processing

Proceedings of the 8th ACM European Conference on Computer Systems
Big graph mining: algorithms and discoveries

ACM SIGKDD Explorations Newsletter
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems
GraphBuilder: scalable graph ETL framework

First International Workshop on Graph Data Management Experiences and Systems
Early experiences in using a domain-specific language for large-scale graph analysis

First International Workshop on Graph Data Management Experiences and Systems
TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Beyond myopic inference in big data pipelines

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
G-path: flexible path pattern query on large graphs

Proceedings of the 22nd international conference on World Wide Web companion
A first view of exedra: a domain-specific language for large graph analytics workflows

Proceedings of the 22nd international conference on World Wide Web companion
WTF: the who to follow service at Twitter

Proceedings of the 22nd international conference on World Wide Web
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Solving the straggler problem with bounded staleness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Mammoth: autonomic data processing framework for scientific state-transition applications

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Supporting feature location and mining of software repositories on the Amazon EC2

Proceedings of the 51st ACM Southeast Conference
i2MapReduce: incremental iterative MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
Towards systematic parallel programming of graph problems via tree decomposition and tree parallelism

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Efficient data partitioning model for heterogeneous graphs in the cloud

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Entity disambiguation in anonymized graphs using graph kernels

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
PAGE: a partition aware graph computation engine

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
GAPfm: optimal top-n recommendations for graded relevance domains

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
"All roads lead to Rome": optimistic recovery for distributed iterative data processing

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Distributed matrix factorization with mapreduce using a series of broadcast-joins

Proceedings of the 7th ACM conference on Recommender systems
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Pregelix: dataflow-based big graph analytics

Proceedings of the 4th annual Symposium on Cloud Computing
Giraphx: parallel yet serializable large-scale graph processing

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Realtime analysis of information diffusion in social media

Proceedings of the VLDB Endowment
A distributed algorithm for large-scale generalized matching

Proceedings of the VLDB Endowment
Simplifying Scalable Graph Processing with a Domain-Specific Language

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment
Fast iterative graph computation with block updates

Proceedings of the VLDB Endowment
Maximal clique enumeration for large graphs on hadoop framework

Proceedings of the first workshop on Parallel programming for analytics applications
Benchmarking graph-processing platforms: a vision

Proceedings of the 5th ACM/SPEC international conference on Performance engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.