LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
LogGP: incorporating long messages into the LogP model for parallel computation
Journal of Parallel and Distributed Computing
EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Optimizing Metacomputing with Communication-Computation Overlap
PaCT '01 Proceedings of the 6th International Conference on Parallel Computing Technologies
COMB: A Portable Benchmark Suite for Assessing MPI Overlap
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
A Framework for Collective Personalized Communication
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Overlapping of communication and computation and early binding: fundamental mechanisms for improving parallel performance on clusters of workstations
Send-receive considered harmful: Myths and realities of message passing
ACM Transactions on Programming Languages and Systems (TOPLAS)
An analysis of the impact of MPI overlap and independent progress
Proceedings of the 18th annual international conference on Supercomputing
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Analysis of MPI Collective Operations
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
International Journal of High Performance Computing Applications
A Practical Approach to the Rating of Barrier Algorithms Using the LogP Model and Open MPI
ICPPW '05 Proceedings of the 2005 International Conference on Parallel Processing Workshops
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Transformations to Parallel Codes for Communication-Computation Overlap
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Implications of application usage characteristics for collective communication offload
International Journal of High Performance Computing and Networking
A case for non-blocking collective operations
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Optimizing a conjugate gradient solver with non-blocking collective operations
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
The impact of noise on the scaling of collectives: a theoretical approach
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Improving the Performance of Multiple Conjugate Gradient Solvers by Exploiting Overlap
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO
ACM SIGOPS Operating Systems Review
Computer Science - Research and Development
A case for non-blocking collective operations
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Hi-index | 0.01 |
Non-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of collective algorithms and a benchmark to measures it are provided and show a large potential to overlap communication and computation. We show that non-blocking collective operations can provide at least the same benefits as non-blocking point to point operations already do. Our claim is that actual CPU overhead for non-blocking collective operations depends on the message size and the communicator size and benefits especially highly scalable applications with huge communicators. We prove that the share of the overhead of the overall communication time of current blocking collective operations gets smaller with bigger communicators and larger messages. We show that the user level CPU overhead is less than 10% for MPICH2 and LAM/MPI using TCP/IP communication, which leads us to the conclusion that, by using non-blocking collective communication, ideally 90% idle CPU time can be freed for the application.