Global arrays: a nonuniform memory access programming model for high-performance computers
The Journal of Supercomputing
DiP: A Parallel Program Development Environment
Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
GASNet Specification, v1.1
The Tau Parallel Performance System
International Journal of High Performance Computing Applications
High Performance Remote Memory Access Communication: The Armci Approach
International Journal of High Performance Computing Applications
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scalable Detection of MPI-2 Remote Memory Access Inefficiency Patterns
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
The Importance of Non-Data-Communication Overheads in MPI
International Journal of High Performance Computing Applications
Scaling molecular dynamics to 3000 processors with projections: a performance analysis case study
ICCS'03 Proceedings of the 2003 international conference on Computational science
A parallel trace-data interface for scalable performance analysis
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Coarrays in the next Fortran standard
ACM SIGPLAN Fortran Forum
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications
ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Event-Based measurement and analysis of one-sided communication
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Performance characterization of global address space applications: a case study with NWChem
Concurrency and Computation: Practice & Experience
Hi-index | 0.00 |
Partitioned global address space (PGAS) languages combine the convenient abstraction of shared memory with the notion of affinity, extending multi-threaded programming to large-scale systems with physically distributed memory. However, in spite of their obvious advantages, PGAS languages still lack appropriate tool support for performance analysis, one of the reasons why their adoption is still in its infancy. Some of the performance problems for which tool support is needed occur at the level of the underlying one-sided communication substrate, such as the Aggregate Remote Memory Copy Interface (ARMCI). One such example is the waiting time in situations where asynchronous data transfers cannot be completed without software intervention at the target side. This is not uncommon on systems with reduced operating-system kernels such as IBM Blue Gene/P where the use of progress threads would double the number of cores necessary to run an application. In this paper, we present an extension of the Scalasca trace-analysis infrastructure aimed at the identification and quantification of progress-related waiting times at larger scales. We demonstrate its utility and scalability using a benchmark running with up to 32,768 processes.