Leveraging non-blocking collective communication in high-performance applications

Authors:
Torsten Hoefler;Peter Gottschling;Andrew Lumsdaine
Affiliations:
Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA
Venue:
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Year:
2008

Citing 8
Cited 6

Overlap of computation and communication on shared-memory networks-of-workstations

Cluster computing
Send-receive considered harmful: Myths and realities of message passing

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Transformations to Parallel Codes for Communication-Computation Overlap

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimizing a conjugate gradient solver with non-blocking collective operations

Parallel Computing
Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A case for standard non-blocking collective operations

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

MPI-aware compiler optimizations for improving communication-computation overlap

Proceedings of the 23rd international conference on Supercomputing
Scalable communication protocols for dynamic sparse data exchange

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A Robust and Efficient Message Passing Library for Volunteer Computing Environments

Journal of Grid Computing
Trace-based performance analysis for the petascale simulation code FLASH

International Journal of High Performance Computing Applications
Massively parallel direct numerical simulations of forced compressible turbulence: a hybrid MPI/OpenMP approach

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although overlapping communication with computation is an important mechanism for achieving high performance in parallel programs, developing applications that actually achieve good overlap can be difficult. Existing approaches are typically based on manual or compiler-based transformations. This paper presents a pattern and library-based approach to optimizing collective communication in parallel high-performance applications, based on using non-blocking collective operations to enable overlapping of communication and computation. Common communication and computation patterns in iterative SPMD computations are used to motivate the transformations we present. Our approach provides the programmer with the capability to separately optimize communication and computation in an application, while automating the interaction between computation and communication to achieve maximum overlap. Performance results with a model application show more than a 90% decrease in communication overhead, resulting in 21% overall performance improvements.