Synchronization without contention
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Scalable reader-writer synchronization for shared-memory multiprocessors
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimal broadcast and summation in the LogP model
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures
IEEE Parallel & Distributed Technology: Systems & Technology
CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
An Evaluation of Current High-Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Scalable Implementation of a Finite-Volume Dynamical Core in the Community Atmosphere Model
International Journal of High Performance Computing Applications
High performance MPI-2 one-sided communication over InfiniBand
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Natively Supporting True One-Sided Communication in MPI on Multi-core Systems with InfiniBand
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Processing MPI Datatypes Outside MPI
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Scalable communication protocols for dynamic sparse data exchange
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Self-Consistent MPI Performance Guidelines
IEEE Transactions on Parallel and Distributed Systems
A new vision for coarray Fortran
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application
Proceedings of the 24th ACM International Conference on Supercomputing
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel zero-copy algorithms for fast Fourier transform and conjugate gradient using MPI datatypes
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
The PERCS High-Performance Interconnect
HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
The Gemini System Interconnect
HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
Active pebbles: parallel programming for data-driven applications
Proceedings of the international conference on Supercomputing
Proceedings of the 3rd Workshop on Data Center - Converged and Virtual Ethernet Switching
Optimizing the Barnes-Hut algorithm in UPC
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
High performance RDMA protocols in HPC
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Performance Modeling and Comparative Analysis of the MILC Lattice QCD Application su3_rmd
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Cray cascade: a scalable HPC system based on a Dragonfly network
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Adaptive strategy for one-sided communication in MPICH2
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Leveraging MPI's one-sided communication interface for shared-memory programming
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Portable, MPI-interoperable coarray fortran
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.