Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
Polling watchdog: combining polling and interrupts for efficient message handling
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Architectural requirements and scalability of the NAS parallel benchmarks
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A network-failure-tolerant message-passing system for terascale clusters
ICS '02 Proceedings of the 16th international conference on Supercomputing
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Asynchronous MPI messaging on Myrinet
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A New DMA Registration Strategy for Pinning-Based High Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
An MPI Library which uses Polling, Interrupts and Remote Copying for the Fujitsu AP1000+
ISPAN '96 Proceedings of the 1996 International Symposium on Parallel Architectures, Algorithms and Networks
Performance of Various Computers Using Standard Linear Equations Software
Performance of Various Computers Using Standard Linear Equations Software
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
An analysis of the impact of MPI overlap and independent progress
Proceedings of the 18th annual international conference on Supercomputing
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Gravel: A Communication Library to Fast Path MPI
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Lock-Free Asynchronous Rendezvous Design for MPI Point-to-Point Communication
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO
ACM SIGOPS Operating Systems Review
Multicore challenges and benefits for high performance scientific computing
Scientific Programming - Complexity in Scalable Computing
Proceedings of the 23rd international conference on Supercomputing
RSP '09 Proceedings of the 2009 IEEE/IFIP International Symposium on Rapid System Prototyping
A speculative and adaptive MPI rendezvous protocol over RDMA-enabled interconnects
International Journal of Parallel Programming
Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application
Proceedings of the 24th ACM International Conference on Supercomputing
Using triggered operations to offload rendezvous messages
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
When poll is better than interrupt
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Proceedings of the 20th European MPI Users' Group Meeting
Hi-index | 0.00 |
Message Passing Interface (MPI) is a popular parallel programming model for scientific applications. Most high-performance MPI implementations use Rendezvous Protocol for efficient transfer of large messages. This protocol can be designed using either RDMA Write or RDMA Read. Usually, this protocol is implemented using RDMA Write. The RDMA Write based protocol requires a two-way handshake between the sending and receiving processes. On the other hand, to achieve low latency, MPI implementations often provide a polling based progress engine. The two-way handshake requires the polling progress engine to discover multiple control messages. This in turn places a restriction on MPI applications that they should call into the MPI library to make progress. For compute or I/O intensive applications, it is not possible to do so. Thus, most communication progress is made only after the computation or I/O is over. This hampers the computation to communication overlap severely, which can have a detrimental impact on the overall application performance. In this paper, we propose several mechanisms to exploit RDMA Read and selective interrupt based asynchronous progress to provide better computation/communication overlap on InfiniBand clusters. Our evaluations reveal that it is possible to achieve nearly complete computation/communication overlap using our RDMA Read with Interrupt based Protocol. Additionally, our schemes yield around 50% better communication progress rate when computation is overlapped with communication. Further, our application evaluation with Linpack (HPL) and NAS-SP (Class C) reveals that MPI_Wait time is reduced by around 30% and 28%, respectively, for a 32 node InfiniBand cluster. We observe that the gains obtained in the MPI_Wait time increase as the system size increases. This indicates that our designs have a strong positive impact on scalability of parallel applications.