A bridging model for parallel computation
Communications of the ACM
Compact representations of separable graphs
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
HAGAR: Efficient Multi-context Graph Processors
FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Towards Compressing Web Graphs
DCC '01 Proceedings of the Data Compression Conference
Compressing the Graph Structure of the Web
DCC '01 Proceedings of the Data Compression Conference
Δ-stepping: a parallelizable shortest path algorithm
Journal of Algorithms
The webgraph framework I: compression techniques
Proceedings of the 13th international conference on World Wide Web
BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
On the Architectural Requirements for Efficient Execution of Graph Algorithms
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing irregular parallel algorithms with mutual exclusion and lock-free protocols
Journal of Parallel and Distributed Computing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
GraphStep: A System Architecture for Sparse-Graph Algorithms
FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Efficient Breadth-First Search on the Cell/BE Processor
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Early experiences with large-scale Cray XMT systems
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Scalable communication protocols for dynamic sparse data exchange
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Kronecker Graphs: An Approach to Modeling Networks
The Journal of Machine Learning Research
High-performance graph algorithms from parallel sparse matrices
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Accelerating large graph algorithms on the GPU using CUDA
HiPC'07 Proceedings of the 14th international conference on High performance computing
Analysis of link graph compression techniques
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
An effective GPU implementation of breadth-first search
Proceedings of the 47th Design Automation Conference
Fast PGAS Implementation of Distributed Graph Algorithms
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable Graph Exploration on Multicore Processors
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Accelerating CUDA graph algorithms at maximum warp
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
The IBM Blue Gene/Q interconnection network and message unit
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel breadth-first search on distributed memory systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
An In-depth Study of Stochastic Kronecker Graphs
ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Green-Marl: a DSL for easy and efficient graph analysis
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
The IBM Blue Gene/Q Compute Chip
IEEE Micro
An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture
IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
The IBM Blue Gene/Q Interconnection Fabric
IEEE Micro
Programming with relaxed synchronization
Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Hi-index | 0.00 |
In this paper, we describe the challenges involved in designing a family of highly-efficient Breadth-First Search (BFS) algorithms and in optimizing these algorithms on the latest two generations of Blue Gene machines, Blue Gene/P and Blue Gene/Q. With our recent winning Graph 500 submissions in November 2010, June 2011, and November 2011, we have achieved unprecedented scalability results in both space and size. On Blue Gene/P, we have been able to parallelize a scale 38 problem with 238 vertices and 242 edges on 131,072 processing cores. Using only four racks of an experimental configuration of Blue Gene/Q, we have achieved a processing rate of 254 billion edges per second on 65,536 processing cores. This paper describes the algorithmic design and the main classes of optimizations that we have used to achieve these results.