A scalable infrastructure for the performance analysis of passive target synchronization

Authors:
Marc-André Hermanns;Sriram Krishnamoorthy;Felix Wolf
Affiliations:
German Research School for Simulation Sciences, 52062 Aachen, Germany and Dept. of Computer Science, RWTH Aachen University, 52056 Aachen, Germany;Computer Science and Mathematics Division, Pacific Northwest National Laboratory, Richland, WA, USA;German Research School for Simulation Sciences, 52062 Aachen, Germany and Dept. of Computer Science, RWTH Aachen University, 52056 Aachen, Germany and Jülich Supercomputing Centre, Forschungs ...
Venue:
Parallel Computing
Year:
2013

Citing 15
Cited 0

Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
DiP: A Parallel Program Development Environment

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
GASNet Specification, v1.1

GASNet Specification, v1.1
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
High Performance Remote Memory Access Communication: The Armci Approach

International Journal of High Performance Computing Applications
Parallel performance wizard: a performance analysis tool for partitioned global-address-space programming models

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A scalable tool architecture for diagnosing wait states in massively parallel applications

Parallel Computing
Scalable Detection of MPI-2 Remote Memory Access Inefficiency Patterns

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
The Importance of Non-Data-Communication Overheads in MPI

International Journal of High Performance Computing Applications
Scaling molecular dynamics to 3000 processors with projections: a performance analysis case study

ICCS'03 Proceedings of the 2003 international conference on Computational science
A parallel trace-data interface for scalable performance analysis

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Coarrays in the next Fortran standard

ACM SIGPLAN Fortran Forum
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Event-Based measurement and analysis of one-sided communication

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Performance characterization of global address space applications: a case study with NWChem

Concurrency and Computation: Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

Partitioned global address space (PGAS) languages combine the convenient abstraction of shared memory with the notion of affinity, extending multi-threaded programming to large-scale systems with physically distributed memory. However, in spite of their obvious advantages, PGAS languages still lack appropriate tool support for performance analysis, one of the reasons why their adoption is still in its infancy. Some of the performance problems for which tool support is needed occur at the level of the underlying one-sided communication substrate, such as the Aggregate Remote Memory Copy Interface (ARMCI). One such example is the waiting time in situations where asynchronous data transfers cannot be completed without software intervention at the target side. This is not uncommon on systems with reduced operating-system kernels such as IBM Blue Gene/P where the use of progress threads would double the number of cores necessary to run an application. In this paper, we present an extension of the Scalasca trace-analysis infrastructure aimed at the identification and quantification of progress-related waiting times at larger scales. We demonstrate its utility and scalability using a benchmark running with up to 32,768 processes.