High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters

Authors:
Matthew J. Koop;Sayantan Sur;Qi Gao;Dhabaleswar K. Panda
Affiliations:
The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH
Venue:
Proceedings of the 21st annual international conference on Supercomputing
Year:
2007

Citing 19
Cited 8

Very high resolution simulation of compressible turbulence on the IBM-SP system

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Semicoarsening Multigrid on Distributed Memory Machines

SIAM Journal on Scientific Computing
Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
An empirical performance evaluation of scalable scientific applications

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
High performance RDMA-based MPI implementation over InfiniBand

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Impact of On-Demand Connection Management in MPI over VIA

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
A General Predictive Performance Model for Wavefront Algorithms on Clusters of SMPs

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
A network-failure-tolerant message-passing system for terascale clusters

International Journal of Parallel Programming
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Connection-less TCP

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

HOTI '05 Proceedings of the 13th Symposium on High Performance Interconnects
Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
On using connection-oriented vs. connection-less transport for performance and scalability of collective and one-sided operations: trade-offs and impact

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Infiniband scalability in open MPI

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Shared receive queue based scalable MPI design for infiniband clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Adaptive connection management for scalable MPI over InfiniBand

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Can software reliability outperform hardware reliability on high performance interconnects?: a case study with MPI over infiniband

Proceedings of the 22nd annual international conference on Supercomputing
X-SRQ - Improving Scalability and Performance of Multi-core InfiniBand Clusters

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Efficient On-Demand Connection Management Mechanisms with PGAS Models over InfiniBand

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Unifying UPC and MPI runtimes: experience with MVAPICH

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Introducing scalable quantum approaches in language representation

QI'11 Proceedings of the 5th international conference on Quantum interaction
WMTools - assessing parallel application memory utilisation at scale

EPEW'11 Proceedings of the 8th European conference on Computer Performance Engineering
Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Journal of Parallel and Distributed Computing
A study of application-level recovery methods for transient network faults

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-performance clusters have been growing rapidly in scale. Most of these clusters deploy a high-speed interconnect, such as Infini-Band, to achieve higher performance. Most scientific applications executing on these clusters use the Message Passing Interface (MPI) as the parallel programming model. Thus, the MPI library has a key role in achieving application performance by consuming as few resources as possible and enabling scalable performance. State-of-the-art MPI implementations over InfiniBand primarily use the Reliable Connection (RC) transport due to its good performance and attractive features. However, the RC transport requires a connection between every pair of communicating processes, with each requiring several KB of memory. As clusters continue to scale, memory requirements in RC-based implementations increase. The connection-less Unreliable Datagram (UD) transport is an attractive alternative, which eliminates the need to dedicate memory for each pair of processes. In this paper we present a high-performance UD-based MPI design. We implement our design and compare the performance and resource usage with the RC-based MVAPICH. We evaluate NPB, SMG2000, Sweep3D, and sPPM up to 4K processes on an 9216-core InfiniBand cluster. For SMG2000, our prototype shows a 60% speedup and seven-fold reduction in memory for 4K processes. Additionally, based on our model, our design has an estimated 30 times reduction in memory over MVAPICH at 16K processes when all connections are created. To the best of our knowledge, this is the first research work that presents a high-performance MPI design over InfiniBand that is completely based on UD and can achieve near identical or better application performance than RC.