Scalable GPU graph traversal

Authors:
Duane Merrill;Michael Garland;Andrew Grimshaw
Affiliations:
Unversity of Virginia, Charlottesville, VA, USA;NVIDIA Corporation, Santa Clara, CA, USA;University of Virginia, Charlottesville, VA, USA
Venue:
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Year:
2012

Citing 19
Cited 24

Data parallel algorithms

Communications of the ACM - Special issue on parallelism
Scans as Primitive Parallel Operations

IEEE Transactions on Computers
High-probability parallel transitive closure algorithms

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Scan primitives for vector computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Introduction to Algorithms

Introduction to Algorithms
On the Architectural Requirements for Efficient Execution of Graph Algorithms

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Fast scan algorithms on graphics processors

Proceedings of the 22nd annual international conference on Supercomputing
Sparse matrix computations on manycore GPU's

Proceedings of the 45th annual Design Automation Conference
Efficient Breadth-First Search on the Cell/BE Processor

IEEE Transactions on Parallel and Distributed Systems
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Taming irregular EDA applications on GPUs

Proceedings of the 2009 International Conference on Computer-Aided Design
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
An effective GPU implementation of breadth-first search

Proceedings of the 47th Design Automation Conference
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

A GPU implementation of inclusion-based points-to analysis

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
On parallel software verification using boolean equation systems

SPIN'12 Proceedings of the 19th international conference on Model Checking Software
Nested data-parallelism on the gpu

Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
A yoke of oxen and a thousand chickens for heavy lifting graph processing

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Direction-optimizing breadth-first search

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Ligra: a lightweight graph processing framework for shared memory

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Morph algorithms on GPUs

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel schedule synthesis for attribute grammars

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
StreamScan: fast scan algorithms for GPUs without global barrier synchronization

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-Conscious Wavefront Scheduling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Betweenness centrality on GPUs and heterogeneous architectures

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Atomic-free irregular computations on GPUs

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
SemCache: semantics-aware caching for efficient GPU offloading

Proceedings of the 27th international ACM conference on International conference on supercomputing
General transformations for GPU execution of tree traversals

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On fast parallel detection of strongly connected components (SCC) in small-world graphs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Trellis: Portability across architectures with a high-level framework

Journal of Parallel and Distributed Computing
The energy case for graph processing on hybrid CPU and GPU systems

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Energy efficient GPU transactional memory via space-time optimizations

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

Parallel Computing
Efficient decomposition of strongly connected components on GPUs

Journal of Systems Architecture: the EUROMICRO Journal
Load balanced clustering coefficients

Proceedings of the first workshop on Parallel programming for analytics applications
Benchmarking graph-processing platforms: a vision

Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Direction-optimizing breadth-first search

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.