Of hammers and nails: an empirical comparison of three paradigms for processing large graphs

Authors:
Marc Najork;Dennis Fetterly;Alan Halverson;Krishnaram Kenthapadi;Sreenivas Gollapudi
Affiliations:
Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA;Microsoft Research, Madison, WI, USA;Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA
Venue:
Proceedings of the fifth ACM international conference on Web search and data mining
Year:
2012

Citing 29
Cited 4

Vertical partitioning algorithms for database design

ACM Transactions on Database Systems (TODS)
The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The stochastic approach for link-structure analysis (SALSA) and the TKC effect

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Constrained random walks on random graphs: routing algorithms for large scale wireless sensor networks

WSNA '02 Proceedings of the 1st ACM international workshop on Wireless sensor networks and applications
Data Structures and Algorithms

Data Structures and Algorithms
Introduction to Algorithms

Introduction to Algorithms
The Link Database: Fast Access to Graphs of the Web

DCC '02 Proceedings of the Data Compression Conference
Compressing the Graph Structure of the Web

DCC '01 Proceedings of the Data Compression Conference
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
On graph problems in a semi-streaming model

Theoretical Computer Science - Automata, languages and programming: Algorithms and complexity (ICALP-A 2004)
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Random walks on the click graph

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Less is more: sampling the neighborhood graph makes SALSA better and faster

Proceedings of the Second ACM International Conference on Web Search and Data Mining
The scalable hyperlink store

Proceedings of the 20th ACM conference on Hypertext and hypermedia
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Graph Distances in the Data-Stream Model

SIAM Journal on Computing
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Column-oriented database systems

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
A sketch-based distance oracle for web-scale graphs

Proceedings of the third ACM international conference on Web search and data mining
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Naiad: a timely dataflow system

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Parallel processing of large graphs

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many phenomena and artifacts such as road networks, social networks and the web can be modeled as large graphs and analyzed using graph algorithms. However, given the size of the underlying graphs, efficient implementation of basic operations such as connected component analysis, approximate shortest paths, and link-based ranking (e.g. PageRank) becomes challenging. This paper presents an empirical study of computations on such large graphs in three well-studied platform models, viz., a relational model, a data-parallel model, and a special-purpose in-memory model. We choose a prototypical member of each platform model and analyze the computational efficiencies and requirements for five basic graph operations used in the analysis of real-world graphs viz., PageRank, SALSA, Strongly Connected Components (SCC), Weakly Connected Components (WCC), and Approximate Shortest Paths (ASP). Further, we characterize each platform in terms of these computations using model-specific implementations of these algorithms on a large web graph. Our experiments show that there is no single platform that performs best across different classes of operations on large graphs. While relational databases are powerful and flexible tools that support a wide variety of computations, there are computations that benefit from using special-purpose storage systems and others that can exploit data-parallel platforms.