A scalable tool architecture for diagnosing wait states in massively parallel applications

Authors:
Markus Geimer;Felix Wolf;Brian J. N. Wylie;Bernd Mohr
Affiliations:
Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany;Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany and Department of Computer Science, RWTH Aachen University, 52056 Aachen, Germany;Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany;Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany
Venue:
Parallel Computing
Year:
2009

Citing 18
Cited 7

Waiting time analysis and performance visualization in Carnival

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Performance analysis of distributed applications using automatic classification of communication inefficiencies

Proceedings of the 14th international conference on Supercomputing
From trace generation to visualization: a performance framework for distributed parallel systems

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
IPS-2: The Second Generation of a Parallel Program Measurement System

IEEE Transactions on Parallel and Distributed Systems
On the Scalability of Tracing Mechanisms

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
DiP: A Parallel Program Development Environment

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Automatic performance analysis of hybrid MPI/OpenMP applications

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Evolutions in parallel distributed and network-based processing
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Construction and Compression of Complete Call Graphs for Post-Mortem Program Trace Analysis

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Toward Scalable Performance Visualization with Jumpshot

International Journal of High Performance Computing Applications
Preserving time in large-scale communication traces

Proceedings of the 22nd annual international conference on Supercomputing
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
Verifying Causality between Distant Performance Phenomena in Large-Scale MPI Applications

PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
A parallel trace-data interface for scalable performance analysis

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Scalable parallel trace-based performance analysis

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Performance analysis and tuning of the XNS CFD solver on Blue Gene/L

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Timestamp synchronization for event traces of large-scale message-passing applications

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Scalable Detection of MPI-2 Remote Memory Access Inefficiency Patterns

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Scalable massively parallel I/O to task-local files

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scalable Communication Trace Compression

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Scalable detection of MPI-2 remote memory access inefficiency patterns

International Journal of High Performance Computing Applications
A micro-architectural analysis of switched photonic multi-chip interconnects

Proceedings of the 39th Annual International Symposium on Computer Architecture
Extending the scope of the controlled logical clock

Cluster Computing
A scalable infrastructure for the performance analysis of passive target synchronization

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

When scaling message-passing applications to thousands of processors, their performance is often affected by wait states that occur when processes fail to reach synchronization points simultaneously. As a first step in reducing the performance impact, we have shown in our earlier work that wait states can be diagnosed by searching event traces for characteristic patterns. However, our initial sequential search method did not scale beyond several hundred processes. Here, we present a scalable approach, based on a parallel replay of the target application's communication behavior, that can efficiently identify wait states at the previously inaccessible scale of 65,536 processes and that has potential for even larger configurations. We explain how our new approach has been integrated into a comprehensive parallel tool architecture, which we use to demonstrate that wait states may consume a major fraction of the execution time at larger scales.