Understanding parallelism in graph traversal on multi-core clusters

Authors:
Huiwei Lv;Guangming Tan;Mingyu Chen;Ninghui Sun
Affiliations:
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate School of Chinese Academy of Sciences, Beijing, China;State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Venue:
Computer Science - Research and Development
Year:
2013

Citing 17
Cited 1

MPI versus MPI+OpenMP on IBM SP for the NAS benchmarks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Terascale spectral element dynamical core for atmospheric general circulation models

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs

Journal of Parallel and Distributed Computing
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Efficient Breadth-First Search on the Cell/BE Processor

IEEE Transactions on Parallel and Distributed Systems
An efficient transactional memory algorithm for computing minimum spanning forest of sparse graphs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Early experiences with large-scale Cray XMT systems

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Analysis and performance results of computing betweenness centrality on IBM Cyclops64

The Journal of Supercomputing
Parallel breadth-first search on distributed memory systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The Combinatorial BLAS: design, implementation, and applications

International Journal of High Performance Computing Applications
Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

Efficient breadth first search on multi-GPU systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is an ever-increasing need for exploring large-scale graph data sets in computational sciences, social networks, and business analytics. However, due to irregular and memory-intensive nature, graph applications are notoriously known for their poor performance on parallel computer systems. In this paper we propose a new hybrid MPI/Pthreads breadth-first search (BFS) algorithm featuring with (i) overlapping computation and communication by separating them into multiple threads, (ii) maximizing multi-threading parallelism on multi-cores with massive threads to improve throughputs, and (iii) exploiting pipeline parallelism using lock-free queues for asynchronous communication. By comparing it with traditional MPI-only BFS algorithm, we learned several valuable lessons that would help to understand and exploit parallelism in graph traversal applications. Experiments show our algorithm is 1.9脳 faster than the MPI-only version, capable of processing 1.45 billion edges per second on a 32-node SMP cluster. At a large scale, our algorithm is 1.49脳 than the MPI-only BFS algorithm in Combinatorial BLAS Library with 6,144 cores.