Enabling highly-scalable remote memory access programming with MPI-3 one sided

Authors:
Robert Gerstenberger;Maciej Besta;Torsten Hoefler
Affiliations:
ETH Zurich, Zurich, Switzerland;ETH Zurich, Zurich, Switzerland;ETH Zurich, Zurich, Switzerland
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 31
Cited 1

Synchronization without contention

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Scalable reader-writer synchronization for shared-memory multiprocessors

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimal broadcast and summation in the LogP model

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures

IEEE Parallel & Distributed Technology: Systems & Technology
A Comparative Characterization of Communication Patterns in Applications Using MPI and Shared Memory on an IBM SP2

CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
An Evaluation of Current High-Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Scalable Implementation of a Finite-Volume Dynamical Core in the Community Atmosphere Model

International Journal of High Performance Computing Applications
High performance MPI-2 one-sided communication over InfiniBand

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Natively Supporting True One-Sided Communication in MPI on Multi-core Systems with InfiniBand

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Processing MPI Datatypes Outside MPI

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Scalable communication protocols for dynamic sparse data exchange

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Self-Consistent MPI Performance Guidelines

IEEE Transactions on Parallel and Distributed Systems
A new vision for coarray Fortran

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application

Proceedings of the 24th ACM International Conference on Supercomputing
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel zero-copy algorithms for fast Fourier transform and conjugate gradient using MPI datatypes

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
The PERCS High-Performance Interconnect

HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
The Gemini System Interconnect

HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
Active pebbles: parallel programming for data-driven applications

Proceedings of the international conference on Supercomputing
Performance evaluation of the RDMA over ethernet (RoCE) standard in enterprise data centers infrastructure

Proceedings of the 3rd Workshop on Data Center - Converged and Virtual Ethernet Switching
Optimizing the Barnes-Hut algorithm in UPC

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
High performance RDMA protocols in HPC

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Performance Modeling and Comparative Analysis of the MILC Lattice QCD Application su3_rmd

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Cray cascade: a scalable HPC system based on a Dragonfly network

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Adaptive strategy for one-sided communication in MPICH2

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Leveraging MPI's one-sided communication interface for shared-memory programming

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface

Portable, MPI-interoperable coarray fortran

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.