A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

Authors:
Charles E. Leiserson;Tao B. Schardl
Affiliations:
MIT CSAIL, Cambridge, MA, USA;MIT CSAIL, Cambridge, MA, USA
Venue:
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Year:
2010

Citing 20
Cited 17

Speedup Versus Efficiency in Parallel Systems

IEEE Transactions on Computers
What are race conditions?: Some issues and formalizations

ACM Letters on Programming Languages and Systems (LOPLAS)
Programming parallel algorithms

Communications of the ACM
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Space-Efficient Scheduling of Multithreaded Computations

SIAM Journal on Computing
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Geometric Mesh Partitioning: Implementation and Experiments

SIAM Journal on Scientific Computing
The Parallel Evaluation of General Arithmetic Expressions

Journal of the ACM (JACM)
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
DNA electrophoresis studied with the cage model

Journal of Computational Physics
Protocol Verification as a Hardware Design Aid

ICCD '92 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Reducers and other Cilk++ hyperobjects

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
Large-scale parallel breadth-first search

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
The Cilk++ concurrency platform

The Journal of Supercomputing
Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

The Cilkview scalability analyzer

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Parallel computation of the minimal elements of a poset

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Ordered and unordered algorithms for parallel breadth first search

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Parallel breadth-first search on distributed memory systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Massively parallel breadth first search using a tree-structured memory model

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Internally deterministic parallel algorithms can be fast

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Fast track article: CoMP clustering and backhaul limitations in cooperative cellular mobile access networks

Pervasive and Mobile Computing
Elixir: a system for synthesizing concurrent graph programs

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Ligra: a lightweight graph processing framework for shared memory

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
An efficient programming model for memory-intensive recursive algorithms using parallel disks

Proceedings of the 37th International Symposium on Symbolic and Algebraic Computation
GPUDet: a deterministic GPU architecture

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Understanding parallelism in graph traversal on multi-core clusters

Computer Science - Research and Development
Parallel graph decompositions using random shifts

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Deterministic galois: on-demand, portable and parameterless

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standar. C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel implementation of a multiset data structure, called a "bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices -- a condition met by many real-world graphs -- PBFS demonstrates good speedup with the number of processing cores. Since PBFS employs a nonconstant-time "reducer" -- "hyperobject" feature of Cilk++ -- the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm runs it time O((V+E)/P + Dlg3(V/D)) on P processors, which means that it attains near-perfect linear speedup if P V+E)/Dlg3(V/D).