High Performance Remote Memory Access Communication: The Armci Approach

Authors:
J. Nieplocha;V. Tipparaju;M. Krishnan;D. K. Panda
Affiliations:
COMPUTATIONAL SCIENCES AND MATHEMATICS DEPARTMENT, PACIFIC NORTHWEST NATIONAL LABORATORY, RICHLAND, WA 99352/;-;COMPUTATIONAL SCIENCES AND MATHEMATICS DEPARTMENT, PACIFIC NORTHWEST NATIONAL LABORATORY, RICHLAND, WA 99352;OHIO STATE UNIVERSITY
Venue:
International Journal of High Performance Computing Applications
Year:
2006

Citing 30
Cited 27

Memory access buffering in multiprocessors

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Correct memory operation of cache-based multiprocessors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
ScaLAPACK user's guide

ScaLAPACK user's guide
Modeling communication pipeline latency

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
Responsiveness without interrupts

ICS '99 Proceedings of the 13th international conference on Supercomputing
Location Consistency-A New Memory Model and Cache Consistency Protocol

IEEE Transactions on Computers
Communication overlap in multi-tier parallel algorithms

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Global arrays: a portable "shared-memory" programming model for distributed memory computers

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
One-Sided Communication on Clusters with Myrinet

Cluster Computing
Performance Evaluation of the Quadrics Interconnection Network

Cluster Computing
Efficient Multicast on Myrinet using Link-Level Flow Control

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Exploting communication Latency Hiding for Parallel Network

Proceedings of the 1994 International Conference on Parallel and Distributed Systems
COMB: A Portable Benchmark Suite for Assessing MPI Overlap

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Optimizing Message Aggregation for Parallel Simulation on High Performance Clusters

MASCOTS '99 Proceedings of the 7th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
The Effect of Limited Network Bandwidth and its Utilization by Latency Hiding Techniques in Large-scale Shared Memory Systems

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
An Evaluation of Current High-Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Performance and Experience with LAPI -- A New High-Performance Communication Library for the IBM RS/6000 SP

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
GASNet Specification, v1.1

GASNet Specification, v1.1
Generalized portable shmem library for high performance computing

Generalized portable shmem library for high performance computing
Processor-Group Aware Runtime Support for Shared- and Global-Address Space Models

ICPPW '04 Proceedings of the 2004 International Conference on Parallel Processing Workshops
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations

International Journal of High Performance Computing and Networking

Optimizing All-to-All Collective Communication by Exploiting Concurrency in Modern Networks

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Design and implementation of a one-sided communication interface for the IBM eServer Blue Gene® supercomputer

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A High-Performance Event Service for HPC Applications

SE-HPC '07 Proceedings of the 3rd International Workshop on Software Engineering for High Performance Computing Applications
A framework for characterizing overlap of communication and computation in parallel applications

Cluster Computing
Latency-Optimized Parallelization of the FMM Near-Field Computations

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Integrated Data and Task Management for Scientific Applications

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Lock-Free Asynchronous Rendezvous Design for MPI Point-to-Point Communication

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Scalable transparent checkpoint-restart of global address space applications on virtual machines over infiniband

Proceedings of the 6th ACM conference on Computing frontiers
Enabling a highly-scalable global address space model for petascale computing

Proceedings of the 7th ACM international conference on Computing frontiers
The 48-core SCC Processor: the Programmer's View

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Introducing OpenSHMEM: SHMEM for the PGAS community

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
An open-source compiler and runtime implementation for Coarray Fortran

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
The Combinatorial BLAS: design, implementation, and applications

International Journal of High Performance Computing Applications
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Journal of Parallel and Distributed Computing
GA-GPU: extending a library-based global address spaceprogramming model for scalable heterogeneouscomputing systems

Proceedings of the 9th conference on Computing Frontiers
On reducing i/o overheads in large-scale invariant subspace projections

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Work stealing and persistence-based load balancers for iterative overdecomposed applications

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Global Futures: A Multithreaded Execution Model for Global Arrays-based Applications

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Audit: A new synchronization API for the GET/PUT protocol

Journal of Parallel and Distributed Computing
HiCOO: Hierarchical cooperation for scalable communication in Global Address Space programming models on Cray XT systems

Journal of Parallel and Distributed Computing
Code generation for parallel execution of a class of irregular loops on distributed memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An efficient kernel-level blocking MPI implementation

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Efficient MPI implementation of a parallel, stable merge algorithm

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Refactoring and automated performance tuning of computational chemistry application codes

Proceedings of the Winter Simulation Conference
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
A scalable infrastructure for the performance analysis of passive target synchronization

Parallel Computing
A framework for load balancing of tensor contraction expressions via dynamic task partitioning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the Aggregate Remote Memory Copy Interface (ARMCI), a portable high performance remote memory access communication interface, developed oriinally under the U.S. Department of Energy (DOE) Advanced Computational Testing and Simulation Toolkit project and currently used and advanced as a part of the run-time layer of the DOE project, Programming Models for Scalble Parallel Computing. The paper discusses the model, addresses challenges of portable implementations, and demonstrates that ARMCI delivers high performance on a variety of platforms. Special emphasis is placed on the latency hiding mechanisms and ability to optimize noncotiguous data transfers.