Verifying Causality between Distant Performance Phenomena in Large-Scale MPI Applications

Authors:
Marc-Andre Hermanns;Markus Geimer;Felix Wolf;Brian J. N. Wylie
Affiliations:
-;-;-;-
Venue:
PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Year:
2009

Citing 0
Cited 9

A scalable tool architecture for diagnosing wait states in massively parallel applications

Parallel Computing
Scalable Communication Trace Compression

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
LogGOPSim: simulating large-scale applications in the LogGOPS model

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Performance simulation of non-blocking communication in message-passing applications

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
ScalaExtrap: Trace-based communication extrapolation for SPMD programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Pattern-independent detection of manual collectives in MPI programs

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Towards I/O analysis of HPC systems and a generic architecture to collect access patterns

Computer Science - Research and Development
Simulating parallel programs on application and system level

Computer Science - Research and Development
Using automated performance modeling to find scalability bugs in complex codes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

In message-passing applications, the temporal or spatial distance between cause and symptom of a performance problem constitutes a major difficulty in deriving helpful conclusions from performance data. Just knowing the locations of wait states in the program is often insufficient to understand the reason for their occurrence. We present a method for verifying hypotheses on causality between temporally or spatially distant performance phenomena in message-passing applications without altering the application itself. The verification is accomplished by modifying MPI event traces and using them to simulate the hypothetical message-passing behavior. By performing a parallel real-time reenactment of the communication to be simulated using the original execution configuration, we can achieve high scalability and good predictive accuracy in relation to the measured behavior. Not relying on a potentially complex model of the message-passing subsystem, our method is also platform independent.