High performance RDMA-based MPI implementation over infiniBand

Authors:
Jiuxing Liu;Jiesheng Wu;Dhabaleswar K. Panda
Affiliations:
Computer and Information Science, The Ohio State University, Columbus, OH;Computer and Information Science, The Ohio State University, Columbus, OH;Computer and Information Science, The Ohio State University, Columbus, OH
Venue:
International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Year:
2004

Citing 20
Cited 25

Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
High performance messaging on workstations: Illinois fast messages (FM) for Myrinet

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems

IEEE Transactions on Parallel and Distributed Systems
MPI-StarT: delivering network performance to numerical applications

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Experiences with VI communication for database storage

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Virtual-Memory-Mapped Network Interfaces

IEEE Micro
The Virtual Interface Architecture

IEEE Micro
Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Strategy to Compute the InfiniBand Arbitration Tables

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Structure and Performance of the Direct Access File System

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Ultra-high performance communication with MPI and the Sun fire™ link interconnect

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Impact of On-Demand Connection Management in MPI over VIA

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Efficient Collective Operations Using Remote Memory Operations on VIA-Based Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
User-Level Communication in Cluster-Based Servers

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
MPICH on the T3D: A Case Study of High-Performance Message Passing

MPIDC '96 Proceedings of the Second MPI Developers Conference

Evaluating InfiniBand Performance with PCI Express

IEEE Micro
Analysis of Design Considerations for Optimizing Multi-Channel MPI over InfiniBand

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality and parallelism optimization for dynamic programming algorithm in bioinformatics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Parallel Languages and Compilers: Perspective From the Titanium Experience

International Journal of High Performance Computing Applications
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Global Induction of Decision Trees: From Parallel Implementation to Distributed Evolution

ICAISC '08 Proceedings of the 9th international conference on Artificial Intelligence and Soft Computing
Parallel Implementation of Vascular Network Modeling

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Performance Modeling and Analysis of a Massively Parallel Direct - Part 2

International Journal of High Performance Computing Applications
Scalable transparent checkpoint-restart of global address space applications on virtual machines over infiniband

Proceedings of the 6th ACM conference on Computing frontiers
Full-system simulation of distributed memory multicomputers

Cluster Computing
Data center evolution

Computer Networks: The International Journal of Computer and Telecommunications Networking
Asymmetric flow control for data transfer in hybrid computing systems

IBM Journal of Research and Development
Formal specification of MPI 2.0: Case study in specifying a practical concurrent programming API

Science of Computer Programming
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Shared receive queue based scalable MPI design for infiniband clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Adaptive connection management for scalable MPI over InfiniBand

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hadoop acceleration through network levitated merge

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Performance analysis and optimization of MPI collective operations on multi-core clusters

The Journal of Supercomputing
RXIO: Design and implementation of high performance RDMA-capable GridFTP

Computers and Electrical Engineering
ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
An SRP target mode to improve read performance of SRP-based IB-SANs

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Scale-out NUMA

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
FaRM: fast remote memory

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) operations. In this paper, we propose a new design of MPI over lnfiniBand which brings the benefit oF RDMA to not only large messages, but also small and control messages. We also achieve better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation achieves a latency of 6.8 µsec for small messages and a peak bandwidth of 871 million bytes/sec. Performance evaluation shows that for small messages, our RDMA-based design can reduce the latency by 24%, increase the bandwidth by over 104%, and reduce the host overhead by up to 22%, compared with the original design. For large data transfers, we improve performance by reducing the time for transferring control messages. We have also shown that our new design is beneficial to MPI collective communication and NAS Parallel Benchmarks.