Optimizing a conjugate gradient solver with non-blocking collective operations

Authors:
Torsten Hoefler;Peter Gottschling;Andrew Lumsdaine;Wolfgang Rehm
Affiliations:
Indiana University, Open Systems Lab, Bloomington, IN 47404, USA and Technical University of Chemnitz, Department of Computer Science, 09107 Chemnitz, Germany;Indiana University, Open Systems Lab, Bloomington, IN 47404, USA;Indiana University, Open Systems Lab, Bloomington, IN 47404, USA;Technical University of Chemnitz, Department of Computer Science, 09107 Chemnitz, Germany
Venue:
Parallel Computing
Year:
2007

Citing 15
Cited 9

GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems

SIAM Journal on Scientific and Statistical Computing
CGS, a fast Lanczos-type solver for nonsymmetric linear systems

SIAM Journal on Scientific and Statistical Computing
BI-CGSTAB: a fast and smoothly converging variant of BI-CG for the solution of nonsymmetric linear systems

SIAM Journal on Scientific and Statistical Computing
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
Multigrid

Multigrid
MPI/RT --- An Emerging Standard for High-Performance Real-Time Systems

HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences - Volume 3
An Evaluation of Current High-Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Framework for Collective Personalized Communication

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Send-receive considered harmful: Myths and realities of message passing

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Optimization of MPI collective communication on BlueGene/L systems

Proceedings of the 19th annual international conference on Supercomputing
Automatic generation and tuning of MPI collective communication routines

Proceedings of the 19th annual international conference on Supercomputing
Optimizing All-to-All Collective Communication by Exploiting Concurrency in Modern Networks

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Fast barrier synchronization for InfiniBand™

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking

Advanced collective communication in aspen

Proceedings of the 22nd annual international conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Sparse Non-blocking Collectives in Quantum Mechanical Calculations

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Proceedings of the 24th ACM International Conference on Supercomputing
Design and evaluation of nonblocking collective I/O operations

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Towards autotuning by alternating communication methods

Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
A parallel solution of large-scale heat equation based on distributed memory hierarchy system

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Towards autotuning by alternating communication methods

ACM SIGMETRICS Performance Evaluation Review
First observations using nonblocking collectives in MVAPICH

Proceedings of the 20th European MPI Users' Group Meeting

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a case study that analyzes the suitability and usage of non-blocking collective operations in parallel applications. As with their point-to-point counterparts, non-blocking collective operations provide the ability to overlap communication with computation and to avoid unnecessary synchronization. These operations are provided for MPI programs with LibNBC, a portable low-overhead implementation of non-blocking collective operations built on MPI-1. The straightforward applicability of the LibNBC is demonstrated by incorporating non-blocking collective operations into a parallel conjugate gradient solver. Although only minor changes are required to use them, non-blocking collective operations allow most of the communication costs to be hidden and provide performance improvements of up to 34%. We also show that, because of overlap, there is no significant performance difference between Gigabit Ethernet and InfiniBandTM for special cases of our calculation.