Shared receive queue based scalable MPI design for infiniband clusters

  • Authors:
  • Sayantan Sur;Lei Chai;Hyun-Wook Jin;Dhabaleswar K. Panda

  • Affiliations:
  • Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University

  • Venue:
  • IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clusters of several thousand nodes interconnected with InfiniBand, an emerging high-performance interconnect, have already appeared in the Top 500 list. The next-generation InfiniBand clusters are expected to be even larger with tens-of-thousands of nodes. A high-performance scalable MPI design is crucial for MPI applications in order to exploit the massive potential for parallelism in these very large clusters. MVAPICH is a popular implementation of MPI over InfiniBand based on its reliable connection oriented model. The requirement of this model to make communication buffers available for each connection imposes a memory scalability problem. In order to mitigate this issue, the latest InfiniBand standard includes a new feature called Shared Receive Queue (SRQ) which allows sharing of communication buffers across multiple connections. In this paper, we propose a novel MPI design which efficiently utilizes SRQs and provides very good performance. Our analytical model reveals that our proposed designs will take only 1/10th the memory requirement as compared to the original design on a cluster sized at 16,000 nodes. Performance evaluation of our design on our 8-node cluster shows that our new design was able to provide the same performance as the existing design while requiring much lesser memory. In comparison to tuned existing designs our design showed a 20% and 5% improvement in execution time of NAS Benchmarks (Class A) LU and SP, respectively. The High Performance Linpack was able to execute a much larger problem size using our new design, whereas the existing design ran out of memory.