Speedup Versus Efficiency in Parallel Systems
IEEE Transactions on Computers
What are race conditions?: Some issues and formalizations
ACM Letters on Programming Languages and Systems (LOPLAS)
Programming parallel algorithms
Communications of the ACM
Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Space-Efficient Scheduling of Multithreaded Computations
SIAM Journal on Computing
Thread scheduling for multiprogrammed multiprocessors
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Geometric Mesh Partitioning: Implementation and Experiments
SIAM Journal on Scientific Computing
The Parallel Evaluation of General Arithmetic Expressions
Journal of the ACM (JACM)
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
DNA electrophoresis studied with the cage model
Journal of Computational Physics
Protocol Verification as a Hardware Design Aid
ICCD '92 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Reducers and other Cilk++ hyperobjects
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Introduction to Algorithms, Third Edition
Introduction to Algorithms, Third Edition
Large-scale parallel breadth-first search
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
The Cilk++ concurrency platform
The Journal of Supercomputing
Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
The Cilkview scalability analyzer
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Parallel computation of the minimal elements of a poset
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Ordered and unordered algorithms for parallel breadth first search
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Parallel breadth-first search on distributed memory systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Massively parallel breadth first search using a tree-structured memory model
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Internally deterministic parallel algorithms can be fast
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Pervasive and Mobile Computing
Elixir: a system for synthesizing concurrent graph programs
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Ligra: a lightweight graph processing framework for shared memory
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
An efficient programming model for memory-intensive recursive algorithms using parallel disks
Proceedings of the 37th International Symposium on Symbolic and Algebraic Computation
GPUDet: a deterministic GPU architecture
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Understanding parallelism in graph traversal on multi-core clusters
Computer Science - Research and Development
Parallel graph decompositions using random shifts
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Deterministic galois: on-demand, portable and parameterless
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standar. C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel implementation of a multiset data structure, called a "bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices -- a condition met by many real-world graphs -- PBFS demonstrates good speedup with the number of processing cores. Since PBFS employs a nonconstant-time "reducer" -- "hyperobject" feature of Cilk++ -- the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm runs it time O((V+E)/P + Dlg3(V/D)) on P processors, which means that it attains near-perfect linear speedup if P V+E)/Dlg3(V/D).