Effects of communication latency, overhead, and bandwidth in a cluster architecture
Proceedings of the 24th annual international symposium on Computer architecture
Architectural requirements and scalability of the NAS parallel benchmarks
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A TeraFLOP Supercomputer in 1996: The ASCI TFLOP System
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Portals 3.0: Protocol Building Blocks for Low Overhead Communication
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
High performance RDMA-based MPI implementation over InfiniBand
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Design and Implementation of MPI on Puma Portals
MPIDC '96 Proceedings of the Second MPI Developers Conference
Characterizing a new class of threads in scientific applications for high end supercomputers
Proceedings of the 18th annual international conference on Supercomputing
The Impact of MPI Queue Usage on Message Latency
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Optimizing communication overlap for high-speed networks
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Communication Optimization for Medical Image Reconstruction Algorithms
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Slicing based code parallelization for minimizing inter-processor communication
CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
A speculative and adaptive MPI rendezvous protocol over RDMA-enabled interconnects
International Journal of Parallel Programming
A case for non-blocking collective operations
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
A case for standard non-blocking collective operations
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Extending the MPI-2 generalized request interface
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Improving MPI communication overlap with collaborative polling
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Hi-index | 0.00 |
The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of the Message Passing Interface (MPI) to make independent progress (that is, to make progress on outstanding communication operations while not in the MPI library) is also believed to yield performance benefits. Using an intelligent network interface to offload the work required to support overlap and independent progress is thought to be an ideal solution, but the benefits of this approach have not been studied in depth at the application level. This lack of analysis is complicated by the fact that most MPI implementations do not sufficiently support overlap or independent progress. Recent work has demonstrated a quantifiable advantage for an MPI implementation that uses offload to provide overlap and independent progress. The study is conducted on two different platforms with each having two MPI implementations (one with and one without independent progress). Thus, identical network hardware and virtually identical software stacks are used. Furthermore, one platform, ASCI Red, allows further separation of features such as overlap and offload. Thus, this paper extends previous work by further qualifying the source of the performance advantage: offload, overlap, or independent progress.