RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits

Authors:
Sayantan Sur;Hyun-Wook Jin;Lei Chai;Dhabaleswar K. Panda
Affiliations:
The Ohio State University;The Ohio State University;The Ohio State University;The Ohio State University
Venue:
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2006

Citing 12
Cited 13

Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Polling watchdog: combining polling and interrupts for efficient message handling

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Architectural requirements and scalability of the NAS parallel benchmarks

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A network-failure-tolerant message-passing system for terascale clusters

ICS '02 Proceedings of the 16th international conference on Supercomputing
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Asynchronous MPI messaging on Myrinet

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A New DMA Registration Strategy for Pinning-Based High Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
An MPI Library which uses Polling, Interrupts and Remote Copying for the Fujitsu AP1000+

ISPAN '96 Proceedings of the 1996 International Symposium on Parallel Architectures, Algorithms and Networks
Performance of Various Computers Using Standard Linear Equations Software

Performance of Various Computers Using Standard Linear Equations Software
Communication characteristics of large-scale scientific applications for contemporary cluster architectures

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
An analysis of the impact of MPI overlap and independent progress

Proceedings of the 18th annual international conference on Supercomputing
Implementation and design analysis of a network messaging module using virtual interface architecture

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing

High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A framework for characterizing overlap of communication and computation in parallel applications

Cluster Computing
Gravel: A Communication Library to Fast Path MPI

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Lock-Free Asynchronous Rendezvous Design for MPI Point-to-Point Communication

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO

ACM SIGOPS Operating Systems Review
Multicore challenges and benefits for high performance scientific computing

Scientific Programming - Complexity in Scalable Computing
Maximizing MPI point-to-point communication performance on RDMA-enabled clusters with customized protocols

Proceedings of the 23rd international conference on Supercomputing
Synthesis of Communication Mechanisms for Multi-tile Systems Based on Heterogeneous Multi-processor System-On-Chips

RSP '09 Proceedings of the 2009 IEEE/IFIP International Symposium on Rapid System Prototyping
A speculative and adaptive MPI rendezvous protocol over RDMA-enabled interconnects

International Journal of Parallel Programming
Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application

Proceedings of the 24th ACM International Conference on Supercomputing
Using triggered operations to offload rendezvous messages

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
When poll is better than interrupt

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Revisiting rendezvous protocols in the context of RDMA-capable host channel adapters and many-core processors

Proceedings of the 20th European MPI Users' Group Meeting

Quantified Score

Hi-index	0.00

Visualization

Abstract

Message Passing Interface (MPI) is a popular parallel programming model for scientific applications. Most high-performance MPI implementations use Rendezvous Protocol for efficient transfer of large messages. This protocol can be designed using either RDMA Write or RDMA Read. Usually, this protocol is implemented using RDMA Write. The RDMA Write based protocol requires a two-way handshake between the sending and receiving processes. On the other hand, to achieve low latency, MPI implementations often provide a polling based progress engine. The two-way handshake requires the polling progress engine to discover multiple control messages. This in turn places a restriction on MPI applications that they should call into the MPI library to make progress. For compute or I/O intensive applications, it is not possible to do so. Thus, most communication progress is made only after the computation or I/O is over. This hampers the computation to communication overlap severely, which can have a detrimental impact on the overall application performance. In this paper, we propose several mechanisms to exploit RDMA Read and selective interrupt based asynchronous progress to provide better computation/communication overlap on InfiniBand clusters. Our evaluations reveal that it is possible to achieve nearly complete computation/communication overlap using our RDMA Read with Interrupt based Protocol. Additionally, our schemes yield around 50% better communication progress rate when computation is overlapped with communication. Further, our application evaluation with Linpack (HPL) and NAS-SP (Class C) reveals that MPI_Wait time is reduced by around 30% and 28%, respectively, for a 32 node InfiniBand cluster. We observe that the gains obtained in the MPI_Wait time increase as the system size increases. This indicates that our designs have a strong positive impact on scalability of parallel applications.