Parallel breadth-first search on distributed memory systems

Authors:
Aydin Buluç;Kamesh Madduri
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 26
Cited 9

Parallel graph algorithms

ACM Computing Surveys (CSUR)
An improved parallel algorithm that computes the BFS numbering of a directed graph

Information Processing Letters
High-probability parallel transitive closure algorithms

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Sparse matrices in matlab: design and implementation

SIAM Journal on Matrix Analysis and Applications
Introduction to Algorithms

Introduction to Algorithms
Cut Size Statistics of Graph Bisection Heuristics

SIAM Journal on Optimization
Reducing the bandwidth of sparse symmetric matrices

ACM '69 Proceedings of the 1969 24th national conference
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Lifting sequential graph algorithms for distributed-memory parallel computation

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A computational study of external-memory BFS algorithms

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
A Unified Framework for Numerical and Combinatorial Computing

Computing in Science and Engineering
Efficient Breadth-First Search on the Cell/BE Processor

IEEE Transactions on Parallel and Distributed Systems
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Design and Engineering of External Memory Traversal Algorithms for General Graphs

Algorithmics of Large and Complex Networks
Early experiences with large-scale Cray XMT systems

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Large-scale parallel breadth-first search

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
An effective GPU implementation of breadth-first search

Proceedings of the 47th Design Automation Conference
Fast PGAS Implementation of Distributed Graph Algorithms

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The Combinatorial BLAS: design, implementation, and applications

International Journal of High Performance Computing Applications

Highly scalable graph search for the Graph500 benchmark

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Direction-optimizing breadth-first search

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Breaking the speed and scalability barriers for graph exploration on distributed-memory machines

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On distributed file tree walk of parallel file systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Understanding parallelism in graph traversal on multi-core clusters

Computer Science - Research and Development
Massive data analytics: the graph 500 on IBM Blue Gene/Q

IBM Journal of Research and Development
Efficient breadth first search on multi-GPU systems

Journal of Parallel and Distributed Computing
Direction-optimizing breadth-first search

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny-Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.