Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
A performance analysis of the Berkeley UPC compiler
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-Based Clusters
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Breaking the Connection: RDMA Deconstructed
HOTI '05 Proceedings of the 13th Symposium on High Performance Interconnects
Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
High performance MPI-2 one-sided communication over InfiniBand
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient shared memory and RDMA based design for MPI_Allgather over infiniband
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters
Proceedings of the 21st annual international conference on Supercomputing
Hi-index | 0.00 |
Communication subsystem plays a pivotal role in achieving scalable performance in clusters. The communication semantics employed are dictated by the programming model used by the application such as MPI, UPC, etc. Out of the gamut of communication primitives, collective and one-sided operations are especially significant and have to be designed harnessing the capabilities and features exposed by the underlying networks. In some cases, there is a direct match between the semantics of the operations and the underlying network primitives. InfiniBand provides two transport modes: (i)Connection-oriented Reliable connection (RC) supporting Memory and Channel semantics and (ii) Connection-less Unreliable Datagram (UD) supporting Channel semantics. Achieving good performance and scalability needs careful analysis and design of communication primitives based on these options. In this paper, we evaluate the scalability and performance trade-offs between RC and UD transport modes. We study the semantic advantages of mapping collective and one-sided operations on to memory and channel semantics of InfiniBand(IBA). We take AlltoAll as a case study to demonstrate the benefits of RDMA over Send/Recv and to show the performance/memory trade-offs over IB transports. Our experimental results show that UD-based AlltoAll performs 38% better than Bruck's algorithm for short messages and up to two times better than the direct AlltoAll over RC. Since InfiniBand does not provide RDMA over UD in hardware, we emulate the same in our study. Our results show a performance dip of up to a factor of three for emulated RDMA Read latency as compared to RC, highlighting the need for hardware based RDMA operations over UD.