A case for non-blocking collective operations

Authors:
Torsten Hoefler;Jeffrey M. Squyres;Wolfgang Rehm;Andrew Lumsdaine
Affiliations:
Open Systems Lab, Indiana University, Bloomington, IN;Cisco Systems, San Jose, CA;Technical University of Chemnitz, Chemnitz, Germany;Open Systems Lab, Indiana University, Bloomington, IN
Venue:
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Year:
2006

Citing 23
Cited 6

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Optimizing Metacomputing with Communication-Computation Overlap

PaCT '01 Proceedings of the 6th International Conference on Parallel Computing Technologies
Overlap of computation and communication on shared-memory networks-of-workstations

Cluster computing
COMB: A Portable Benchmark Suite for Assessing MPI Overlap

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
A Framework for Collective Personalized Communication

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Overlapping of communication and computation and early binding: fundamental mechanisms for improving parallel performance on clusters of workstations

Overlapping of communication and computation and early binding: fundamental mechanisms for improving parallel performance on clusters of workstations
Send-receive considered harmful: Myths and realities of message passing

ACM Transactions on Programming Languages and Systems (TOPLAS)
An analysis of the impact of MPI overlap and independent progress

Proceedings of the 18th annual international conference on Supercomputing
Improving application performance on HPC systems with process synchronization

Linux Journal
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications

International Journal of High Performance Computing Applications
A Practical Approach to the Rating of Barrier Algorithms Using the LogP Model and Open MPI

ICPPW '05 Proceedings of the 2005 International Conference on Parallel Processing Workshops
HUNTing the Overlap

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Transformations to Parallel Codes for Communication-Computation Overlap

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Optimizing a conjugate gradient solver with non-blocking collective operations

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
The impact of noise on the scaling of collectives: a theoretical approach

HiPC'05 Proceedings of the 12th international conference on High Performance Computing

Optimizing a conjugate gradient solver with non-blocking collective operations

Parallel Computing
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Improving the Performance of Multiple Conjugate Gradient Solvers by Exploiting Overlap

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO

ACM SIGOPS Operating Systems Review
High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Computer Science - Research and Development
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking

Quantified Score

Hi-index	0.01

Visualization

Abstract

Non-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of collective algorithms and a benchmark to measures it are provided and show a large potential to overlap communication and computation. We show that non-blocking collective operations can provide at least the same benefits as non-blocking point to point operations already do. Our claim is that actual CPU overhead for non-blocking collective operations depends on the message size and the communicator size and benefits especially highly scalable applications with huge communicators. We prove that the share of the overhead of the overall communication time of current blocking collective operations gets smaller with bigger communicators and larger messages. We show that the user level CPU overhead is less than 10% for MPICH2 and LAM/MPI using TCP/IP communication, which leads us to the conclusion that, by using non-blocking collective communication, ideally 90% idle CPU time can be freed for the application.