Scalable temporal order analysis for large scale debugging

Authors:
Dong H. Ahn;Bronis R. de Supinski;Ignacio Laguna;Gregory L. Lee;Ben Liblit;Barton P. Miller;Martin Schulz
Affiliations:
Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA;Purdue University, West Lafayette, IN;Lawrence Livermore National Laboratory, Livermore, CA;University of Wisconsin, Madison, WI;University of Wisconsin, Madison, WI;Lawrence Livermore National Laboratory, Livermore, CA
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 19
Cited 7

Parallel program debugging with on-the-fly anomaly detection

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Event graph visualization for debugging large applications

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Relative debugging: a new methodology for debugging scientific applications

Communications of the ACM
Barrier inference

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Monotonic evolution: an alternative to induction variable substitution for dependence analysis

ICS '01 Proceedings of the 15th international conference on Supercomputing
Tracking down software bugs using automatic anomaly detection

Proceedings of the 24th International Conference on Software Engineering
BoomerAMG: a parallel algebraic multigrid solver and preconditioner

Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
Relative Debugging for Data-Parallel Programs: A ZPL Case Study

IEEE Concurrency
Loop Monotonic Statements

IEEE Transactions on Software Engineering
ROSE: An Optimizing Transformation System for C++ Array-Class Libraries

ECOOP '98 Workshop ion on Object-Oriented Technology
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Barrier matching for programs with textually unaligned barriers

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
PNMPI tools: a whole lot greater than the sum of their parts

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Lessons learned at 208K: towards debugging millions of cores

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
HOLMES: Effective statistical debugging via efficient path profiling

ICSE '09 Proceedings of the 31st International Conference on Software Engineering

FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
GRace: a low-overhead mechanism for detecting data races in GPU programs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Vrisha: using scaling properties of parallel programs for bug detection and localization

Proceedings of the 20th international symposium on High performance distributed computing
Formal analysis of MPI-based parallel programs

Communications of the ACM
Large scale debugging of parallel tasks with AutomaDeD

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Probabilistic diagnosis of performance faults in large-scale parallel applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset

SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering

Quantified Score

Hi-index	0.02

Visualization

Abstract

We present a scalable temporal order analysis technique that supports debugging of large scale applications by classifying MPI tasks based on their logical program execution order. Our approach combines static analysis techniques with dynamic analysis to determine this temporal order scalably. It uses scalable stack trace analysis techniques to guide selection of critical program execution points in anomalous application runs. Our novel temporal ordering engine then leverages this information along with the application's static control structure to apply data flow analysis techniques to determine key application data such as loop control variables. We then use lightweight techniques to gather the dynamic data that determines the temporal order of the MPI tasks. Our evaluation, which extends the Stack Trace Analysis Tool (STAT), demonstrates that this temporal order analysis technique can isolate bugs in benchmark codes with injected faults as well as a real world hang case with AMG2006.