Shared receive queue based scalable MPI design for infiniband clusters

Authors:
Sayantan Sur;Lei Chai;Hyun-Wook Jin;Dhabaleswar K. Panda
Affiliations:
Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 4
Cited 6

Highly parallel computing (2nd ed.)

Highly parallel computing (2nd ed.)
Performance of Various Computers Using Standard Linear Equations Software

Performance of Various Computers Using Standard Linear Equations Software
Connection-less TCP

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
High performance RDMA-based MPI implementation over infiniBand

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)

High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters

Proceedings of the 21st annual international conference on Supercomputing
High-performance ethernet-based communications for future multi-core processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Impact of Node Level Caching in MPI Job Launch Mechanisms

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Unifying UPC and MPI runtimes: experience with MVAPICH

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Investigations on InfiniBand: efficient network buffer utilization at scale

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clusters of several thousand nodes interconnected with InfiniBand, an emerging high-performance interconnect, have already appeared in the Top 500 list. The next-generation InfiniBand clusters are expected to be even larger with tens-of-thousands of nodes. A high-performance scalable MPI design is crucial for MPI applications in order to exploit the massive potential for parallelism in these very large clusters. MVAPICH is a popular implementation of MPI over InfiniBand based on its reliable connection oriented model. The requirement of this model to make communication buffers available for each connection imposes a memory scalability problem. In order to mitigate this issue, the latest InfiniBand standard includes a new feature called Shared Receive Queue (SRQ) which allows sharing of communication buffers across multiple connections. In this paper, we propose a novel MPI design which efficiently utilizes SRQs and provides very good performance. Our analytical model reveals that our proposed designs will take only 1/10th the memory requirement as compared to the original design on a cluster sized at 16,000 nodes. Performance evaluation of our design on our 8-node cluster shows that our new design was able to provide the same performance as the existing design while requiring much lesser memory. In comparison to tuned existing designs our design showed a 20% and 5% improvement in execution time of NAS Benchmarks (Class A) LU and SP, respectively. The High Performance Linpack was able to execute a much larger problem size using our new design, whereas the existing design ran out of memory.