Breaking the speed and scalability barriers for graph exploration on distributed-memory machines

Authors:
Fabio Checconi;Fabrizio Petrini;Jeremiah Willcock;Andrew Lumsdaine;Anamitra Roy Choudhury;Yogish Sabharwal
Affiliations:
IBM TJ Watson, Yorktown Heights, NY;IBM TJ Watson, Yorktown Heights, NY;CREST, Indiana University, Bloomington, IN;CREST, Indiana University, Bloomington, IN;IBM India Research, New Delhi, DL, India;IBM India Research, New Delhi, DL, India
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 33
Cited 1

A bridging model for parallel computation

Communications of the ACM
Compact representations of separable graphs

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
HAGAR: Efficient Multi-context Graph Processors

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Towards Compressing Web Graphs

DCC '01 Proceedings of the Data Compression Conference
Compressing the Graph Structure of the Web

DCC '01 Proceedings of the Data Compression Conference
Δ-stepping: a parallelizable shortest path algorithm

Journal of Algorithms
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
On the Architectural Requirements for Efficient Execution of Graph Algorithms

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing irregular parallel algorithms with mutual exclusion and lock-free protocols

Journal of Parallel and Distributed Computing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
GraphStep: A System Architecture for Sparse-Graph Algorithms

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Efficient Breadth-First Search on the Cell/BE Processor

IEEE Transactions on Parallel and Distributed Systems
Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Early experiences with large-scale Cray XMT systems

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Scalable communication protocols for dynamic sparse data exchange

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Kronecker Graphs: An Approach to Modeling Networks

The Journal of Machine Learning Research
High-performance graph algorithms from parallel sparse matrices

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
Analysis of link graph compression techniques

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
An effective GPU implementation of breadth-first search

Proceedings of the 47th Design Automation Conference
Fast PGAS Implementation of Distributed Graph Algorithms

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
The IBM Blue Gene/Q interconnection network and message unit

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel breadth-first search on distributed memory systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
An In-depth Study of Stochastic Kronecker Graphs

ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Green-Marl: a DSL for easy and efficient graph analysis

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
The IBM Blue Gene/Q Compute Chip

IEEE Micro
An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture

IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
The IBM Blue Gene/Q Interconnection Fabric

IEEE Micro

Programming with relaxed synchronization

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe the challenges involved in designing a family of highly-efficient Breadth-First Search (BFS) algorithms and in optimizing these algorithms on the latest two generations of Blue Gene machines, Blue Gene/P and Blue Gene/Q. With our recent winning Graph 500 submissions in November 2010, June 2011, and November 2011, we have achieved unprecedented scalability results in both space and size. On Blue Gene/P, we have been able to parallelize a scale 38 problem with 238 vertices and 242 edges on 131,072 processing cores. Using only four racks of an experimental configuration of Blue Gene/Q, we have achieved a processing rate of 254 billion edges per second on 65,536 processing cores. This paper describes the algorithmic design and the main classes of optimizations that we have used to achieve these results.