Fast PGAS Implementation of Distributed Graph Algorithms

Authors:
Guojing Cong;George Almasi;Vijay Saraswat
Affiliations:
-;-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 14
Cited 8

A bridging model for parallel computation

Communications of the ACM
Direct bulk-synchronous parallel algorithms

Journal of Parallel and Distributed Computing
Communication optimizations for irregular scientific computations on distributed memory architectures

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Communication-efficient parallel sorting (preliminary version)

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
SIMPLE: a methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs)

SIMPLE: a methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs)
Communication-optimal parallel minimum spanning tree algorithms (extended abstract)

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
External-memory graph algorithms

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
On power-law relationships of the Internet topology

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Designing Practical Efficient Algorithms for Symmetric Multiprocessors

ALENEX '99 Selected papers from the International Workshop on Algorithm Engineering and Experimentation
On the Architectural Requirements for Efficient Execution of Graph Algorithms

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing irregular parallel algorithms with mutual exclusion and lock-free protocols

Journal of Parallel and Distributed Computing

Parallel breadth-first search on distributed memory systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing the Barnes-Hut algorithm in UPC

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Shared work list: hacking amorphous data parallelism in UPC

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Introducing ScaleGraph: an X10 library for billion scale graph analytics

Proceedings of the 2012 ACM SIGPLAN X10 Workshop
Breaking the speed and scalability barriers for graph exploration on distributed-memory machines

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fast and memory-efficient minimum spanning tree on the GPU

International Journal of Computational Science and Engineering
Massive data analytics: the graph 500 on IBM Blue Gene/Q

IBM Journal of Research and Development
PAGE: a partition aware graph computation engine

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the memory intensive workload and the erratic access pattern, irregular graph algorithms are notoriously hard to implement and optimize for high performance on distributed-memory systems. Although the PGAS paradigm proposed recently improves ease of programming, no high performance PGAS implementation of large-scale graph analysis is known. We present the first fast PGAS implementation of graph algorithms for the connected components and minimum spanning tree problems. By improving memory access locality, compared with the naive implementation, our implementation exhibits much better communication efficiency and cache performance on a cluster of SMPs. With additional algorithmic and PGASspecific optimizations, our implementation achieves significant speedups over both the best sequential implementation and the best single-node SMP implementation for large, sparse graphs with more than a billion edges.