Natively Supporting True One-Sided Communication in MPI on Multi-core Systems with InfiniBand

Authors:
G. Santhanaraman;P. Balaji;K. Gopalakrishnan;R. Thakur;W. Gropp;D. K. Panda
Affiliations:
-;-;-;-;-;-
Venue:
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Year:
2009

Citing 9
Cited 6

Portable and Efficient Parallel Computing Using the BSP Model

IEEE Transactions on Computers
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
The implementation of MPI-2 one-sided communication for the NEC SX-5

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Single sided MPI implementations for SUN MPIr

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
The Quadrics Network (QsNet): High-Performance Clustering Technology

HOTI '01 Proceedings of the The Ninth Symposium on High Performance Interconnects
GASNet Specification, v1.1

GASNet Specification, v1.1
Distributed Queue-Based Locking Using Advanced Network Features

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Analysis of implementation options for MPI-2 one-sided

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application

Proceedings of the 24th ACM International Conference on Supercomputing
Optimizing MPI one sided communication on multi-core infiniband clusters using shared memory backed windows

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Adaptive strategy for one-sided communication in MPICH2

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Efficient and truly passive MPI-3 RMA using InfiniBand atomics

Proceedings of the 20th European MPI Users' Group Meeting
Enabling highly-scalable remote memory access programming with MPI-3 one sided

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A fast and resource-conscious MPI message queue mechanism for large-scale jobs

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As high-end computing systems continue to grow in scale, the performance that applications can achieve on such large scale systems depends heavily on their ability to avoid explicitly synchronized communication with other processes in the system. Accordingly, several modern and legacy parallel programming models (such as MPI, UPC, Global Arrays) have provided many programming constructs that enable implicit communication using one-sided communication operations. While MPI is the most widely used communication model for scientific computing, the usage of one-sided communication is restricted; this is mainly owing to the inefficiencies in current MPI implementations that internally rely on synchronization between processes even during one-sided communication, thus losing the potential of such constructs. In our previous work, we had utilized native one-sided communication primitives offered by high-speed networks such as InfiniBand (IB) to allow for true one-sided communication in MPI. In this paper, we extend this work to natively take advantage of one-sided atomic operations on cache-coherent multi-core/multi-processor architectures while still utilizing the benefits of networks such as IB. Specifically, we present a sophisticated hybrid design that uses locks that migrate between IB hardware atomics and multi-core CPU atomics to take advantage of both. We demonstrate the capability of our proposed design with a wide range of experiments illustrating its benefits in performance as well as its potential to avoid explicit synchronization.