Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation

Authors:
Jiuxing Liu;Abhinav Vishnu;Dhabaleswar K. Panda
Affiliations:
Ohio State University;Ohio State University;Ohio State University
Venue:
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Year:
2004

Citing 11
Cited 11

A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
A reliable and scalable striping protocol

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
PSockets: the case for application-level network striping for data intensive applications using high speed wide area networks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Parallel view-dependent isosurface extraction using multi-pass occlusion culling

PVG '01 Proceedings of the IEEE 2001 symposium on parallel and large-data visualization and graphics
MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems

IEEE Transactions on Parallel and Distributed Systems
A network-failure-tolerant message-passing system for terascale clusters

ICS '02 Proceedings of the 16th international conference on Supercomputing
The Quadrics Network: High-Performance Clustering Technology

IEEE Micro
Ultra-high performance communication with MPI and the Sun fire™ link interconnect

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
High performance RDMA-based MPI implementation over InfiniBand

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Striping within the network subsystem

IEEE Network: The Magazine of Global Internetworking

Evaluating InfiniBand Performance with PCI Express

IEEE Micro
A software based approach for providing network fault tolerance in clusters with uDAPL interface: MPI level design and performance evaluation

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

Cluster Computing
Improving communication-phase completion times in HPC clusters through congestion mitigation

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Employing transport layer multi-railing in cluster networks

Journal of Parallel and Distributed Computing
Region-Based Prefetch Techniques for Software Distributed Shared Memory Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Adaptive connection management for scalable MPI over InfiniBand

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Supporting MPI-2 one sided communication on multi-rail infiniband clusters: design challenges and performance benefits

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Using CMT in SCTP-based MPI to exploit multiple interfaces in cluster nodes

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the area of cluster computing, InfiniBand is becoming increasingly popular due to its open standard and high performance. However, even with InfiniBand, network bandwidth can still become the performance bottleneck for some of todayýs most demanding applications. In this paper, we study the problem of how to overcome the bandwidth bottleneck by using multirail networks. We present different ways of setting up multirail networks with InfiniBand and propose a unified MPI design that can support all these approaches. We have also discussed various important design issues and provided in-depth discussions of different policies of using multirail networks, including an adaptive striping scheme that can dynamically change the striping parameters based on current system condition. We have implemented our design and evaluated it using both microbenchmarks and applications. Our performance results show that multirail networks can significant improve MPI communication performance. With a two rail InfiniBand cluster, we have achieved almost twice the bandwidth and half the latency for large messages compared with the original MPI. At the application level, the multirail MPI can significantly reduce communication time as well as running time depending on the communication pattern. We have also shown that the adaptive striping scheme can achieve excellent performance without a priori knowledge of the bandwidth of each rail.