Probabilistic diagnosis of performance faults in large-scale parallel applications

Authors:
Ignacio Laguna;Dong H. Ahn;Bronis R. de Supinski;Saurabh Bagchi;Todd Gamblin
Affiliations:
Purdue University, West Lafayette, IN, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;Purdue University, West Lafayette, IN, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA
Venue:
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Year:
2012

Citing 22
Cited 1

The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Dynamic slicing of computer programs

Journal of Systems and Software
Parallel program performance metrics: a comprison and validation

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Relative debugging: a new methodology for debugging scientific applications

Communications of the ACM
Dynamic software testing of MPI applications with umpire

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Relative Debugging for Data-Parallel Programs: A ZPL Case Study

IEEE Concurrency
Program slicing

ICSE '81 Proceedings of the 5th international conference on Software engineering
Predicate-Based Dynamic Slicing of Message Passing Programs

SCAM '02 Proceedings of the Second IEEE International Workshop on Source Code Analysis and Manipulation
Dynamic slicing of distributed programs

ICSM '95 Proceedings of the International Conference on Software Maintenance
Dynamic Slicing of Parallel Message-Passing Programs

PDP '96 Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96)
Extending a traditional debugger to debug massively parallel applications

Journal of Parallel and Distributed Computing
Analyzing Message-Passing Programs through Visual Slicing

ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II - Volume 02
Data-Flow Analysis for MPI Programs

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Developing Scientific Applications Using Eclipse

Computing in Science and Engineering
Concurrent deadlock detection in parallel programs

International Journal of Computers and Applications
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A graph based approach for MPI deadlock detection

Proceedings of the 23rd international conference on Supercomputing
Scalable temporal order analysis for large scale debugging

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Assertion Based Parallel Debugging

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Large scale debugging of parallel tasks with AutomaDeD

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset

SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and to fix performance failures and correctness problems at scale. Our tool probabilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demonstrate that our tool can isolate the root cause of a particularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least progressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.