High probability parallel transitive-closure algorithms
SIAM Journal on Computing
An introduction to parallel algorithms
An introduction to parallel algorithms
Programming parallel algorithms
Communications of the ACM
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Kronecker Graphs: An Approach to Modeling Networks
The Journal of Machine Learning Research
Accelerating large graph algorithms on the GPU using CUDA
HiPC'07 Proceedings of the 14th international conference on High performance computing
Parallel graph component labelling with GPUs and CUDA
Parallel Computing
Scalable Graph Exploration on Multicore Processors
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Graph partitioning strategies for efficient BFS in shared-nothing parallel systems
WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Accelerating CUDA graph algorithms at maximum warp
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Parallel breadth-first search on distributed memory systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Highly scalable graph search for the Graph500 benchmark
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
2D Partitioning Based Graph Search for the Graph500 Benchmark
IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Understanding parallelism in graph traversal on multi-core clusters
Computer Science - Research and Development
Hi-index | 0.00 |
Simple algorithms for the execution of a Breadth First Search on large graphs lead, running on clusters of GPUs, to a situation of load unbalance among threads and un-coalesced memory accesses, resulting in pretty low performances. To obtain a significant improvement on a single GPU and to scale by using multiple GPUs, we resort to a suitable combination of operations to rearrange data before processing them. We propose a novel technique for mapping threads to data that achieves a perfect load balance by leveraging prefix-sum and binary search operations. To reduce the communication overhead, we perform a pruning operation on the set of edges that needs to be exchanged at each BFS level. The result is an algorithm that exploits at its best the parallelism available on a single GPU and minimizes communication among GPUs. We show that a cluster of GPUs can efficiently perform a distributed BFS on graphs with billions of nodes.