The implementation of MPI-2 one-sided communication for the NEC SX-5
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
High performance RDMA-based MPI implementation over InfiniBand
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Parallel zero-copy algorithms for fast Fourier transform and conjugate gradient using MPI datatypes
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Performance expectations and guidelines for MPI derived datatypes
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Hi-index | 0.00 |
In this paper, we present a new scheme, Send Gather Receive Scatter (SGRS), to perform zero-copy datatype communication over InfiniBand. This scheme leverages the gather/scatter feature provided by InfiniBand channel semantics. It takes advantage of the capability of processing non-contiguity on both send and receive sides in the Send Gather and Receive Scatter operations. We have implemented this new design and evaluated the performance for Message Passing Interface level point-to-point microbenchmarks and collectives, on PCI-X and upcoming high performance PCI-Express systems. In our previous work we had come up with an alternate zero-copy approach using multiple RDMA Writes (Multi-W). Compared to the existing Multi-W zero-copy datatype scheme, the SGRS scheme can overcome the drawbacks of low network utilization and high startup cost. On PCI-X platforms, our experimental results show significant improvement in both point-to-point and collective datatype communication. The latency of a vector datatype can be reduced by up to 62% and the bandwidth shows improvement up to 400% as compared with the Multi-W scheme. The Alltoall collective shows up to 23% reduction in latency. Further, the SGRS scheme shows low CPU overhead with a potential promise for better computation and communication overlap. The experimental results on PCI-Express platforms demonstrate the relevance of zero-copy protocols to overcome memory bandwidth limitations. The trends we observe in PCI-X platform are magnified on PCI-Express platforms with even higher improvement for the microbenchmarks and collectives.