Zero-copy protocol for MPI using infiniband unreliable datagram

  • Authors:
  • Matthew J. Koop;Sayantan Sur;Dhabaleswar K. Panda

  • Affiliations:
  • Network-Based Computing Laboratory, The Ohio State University, 2015 Neil Ave., Columbus, 43210 USA;Network-Based Computing Laboratory, The Ohio State University, 2015 Neil Ave., Columbus, 43210 USA;Network-Based Computing Laboratory, The Ohio State University, 2015 Neil Ave., Columbus, 43210 USA

  • Venue:
  • CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Memory copies are widely regarded as detrimental to the overall performance of applications. High-performance systems make every effort to reduce the number of memory copies, especially the copies incurred during message passing. State of the art implementations of message-passing libraries, such as MPI, utilize user-level networking protocols to reduce or eliminate memory copies. InfiniBand is an emerging user-level networking technology that is gaining rapid acceptance in several domains, including HPC. In order to eliminate message copies while transferring large messages, MPI libraries over InfiniBand employ “zero-copy” protocols which use Remote Direct Memory Access (RDMA). RDMA is available only in the connection-oriented transports of InfiniBand, such as Reliable Connection (RC). However, the Unreliable Datagram (UD) transport of InfiniBand has been shown to scale much better than the RC transport in regard to memory usage. In an optimal design, it should be possible to perform zero-copy message transfers over scalable transports (such as UD). In this paper, we present our design of a novel zero-copy protocol which is directly based over the scalable UD transport. Thus, our protocol achieves the twin objectives of scalability and good performance. Our analysis shows that uni-directional messaging bandwidth can be within 9% of what is achievable over RC for messages of 64KB and above. Application benchmark evaluation shows that our design delivers a 21% speedup for the in.rhodo dataset for LAMMPS over a copy-based approach, giving performance within 1% of RC.