Portals 3.0: Protocol Building Blocks for Low Overhead Communication
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
GASNet Specification, v1.1
Generalized portable shmem library for high performance computing
Generalized portable shmem library for high performance computing
A Multi-Platform Co-Array Fortran Compiler
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
An Evaluation of Two Implementation Strategies for Optimizing One-Sided Atomic Reduction
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Optimizing Strided Remote Memory Access Operations on the Quadrics QsNetII Network Interconnect
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
International Journal of High Performance Computing Applications
High Performance Remote Memory Access Communication: The Armci Approach
International Journal of High Performance Computing Applications
Liquid water: obtaining the right answer for the right reasons
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Analysis of implementation options for MPI-2 one-sided
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Proceedings of the 9th conference on Computing Frontiers
Performance characterization of global address space applications: a case study with NWChem
Concurrency and Computation: Practice & Experience
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Over the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of computational science teams using these new systems. In this regard, Global Address Space (GAS) programming models provide a straightforward and easy to use addressing model, which can lead to improved productivity. However, the scalability of GAS depends directly on the design and implementation of the runtime system on the target petascale distributed-memory architecture. In this paper, we describe the design, implementation, and optimization of the Aggregate Remote Memory Copy Interface (ARMCI) runtime library on the Cray XT5 2.3 PetaFLOPs computer at Oak Ridge National Laboratory. We optimized our implementation with the flow intimation technique that we have introduced in this paper. Our optimized ARMCI implementation improves scalability of both the Global Arrays (GA) programming model and a real-world chemistry application - NWChem - from small jobs up through 180,000 cores.