Efficient breadth first search on multi-GPU systems

Authors:
Enrico Mastrostefano;Massimo Bernaschi
Affiliations:
-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 16
Cited 0

High probability parallel transitive-closure algorithms

SIAM Journal on Computing
An introduction to parallel algorithms

An introduction to parallel algorithms
Programming parallel algorithms

Communications of the ACM
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Kronecker Graphs: An Approach to Modeling Networks

The Journal of Machine Learning Research
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
Parallel graph component labelling with GPUs and CUDA

Parallel Computing
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Graph partitioning strategies for efficient BFS in shared-nothing parallel systems

WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Parallel breadth-first search on distributed memory systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Highly scalable graph search for the Graph500 benchmark

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
2D Partitioning Based Graph Search for the Graph500 Benchmark

IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Understanding parallelism in graph traversal on multi-core clusters

Computer Science - Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Simple algorithms for the execution of a Breadth First Search on large graphs lead, running on clusters of GPUs, to a situation of load unbalance among threads and un-coalesced memory accesses, resulting in pretty low performances. To obtain a significant improvement on a single GPU and to scale by using multiple GPUs, we resort to a suitable combination of operations to rearrange data before processing them. We propose a novel technique for mapping threads to data that achieves a perfect load balance by leveraging prefix-sum and binary search operations. To reduce the communication overhead, we perform a pruning operation on the set of edges that needs to be exchanged at each BFS level. The result is an algorithm that exploits at its best the parallelism available on a single GPU and minimizes communication among GPUs. We show that a cluster of GPUs can efficiently perform a distributed BFS on graphs with billions of nodes.