An effective GPU implementation of breadth-first search

Authors:
Lijuan Luo;Martin Wong;Wen-mei Hwu
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Proceedings of the 47th Design Automation Conference
Year:
2010

Citing 8
Cited 15

External memory BFS on undirected graphs with bounded degree

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Introduction to Algorithms

Introduction to Algorithms
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
GPU acceleration of cutoff pair potentials for molecular modeling applications

Proceedings of the 5th conference on Computing frontiers
Efficient Breadth-First Search on the Cell/BE Processor

IEEE Transactions on Parallel and Distributed Systems
Taming irregular EDA applications on GPUs

Proceedings of the 2009 International Conference on Computer-Aided Design
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing

Parallel breadth-first search on distributed memory systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exploring high throughput computing paradigm for global routing

Proceedings of the International Conference on Computer-Aided Design
A GPU implementation of inclusion-based points-to analysis

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
FlexBFS: a parallelism-aware implementation of breadth-first search on GPU

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GPUs as an opportunity for offloading garbage collection

Proceedings of the 2012 international symposium on Memory Management
Breaking the speed and scalability barriers for graph exploration on distributed-memory machines

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Early evaluation of directive-based GPU programming models for productive exascale computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Approximate weighted matching on emerging manycore and multithreaded architectures

International Journal of High Performance Computing Applications
GPU accelerated genetic clustering

SEAL'12 Proceedings of the 9th international conference on Simulated Evolution and Learning
Morph algorithms on GPUs

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Betweenness centrality on GPUs and heterogeneous architectures

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Massive data analytics: the graph 500 on IBM Blue Gene/Q

IBM Journal of Research and Development
Efficient decomposition of strongly connected components on GPUs

Journal of Systems Architecture: the EUROMICRO Journal
Simulation of Information Propagation over Complex Networks: Performance Studies on Multi-GPU

DS-RT '13 Proceedings of the 2013 IEEE/ACM 17th International Symposium on Distributed Simulation and Real Time Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Breadth-first search (BFS) has wide applications in electronic design automation (EDA) as well as in other fields. Researchers have tried to accelerate BFS on the GPU, but the two published works are both asymptotically slower than the fastest CPU implementation. In this paper, we present a new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy. It guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.