Parallel program debugging with on-the-fly anomaly detection
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Event graph visualization for debugging large applications
SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Relative debugging: a new methodology for debugging scientific applications
Communications of the ACM
POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Monotonic evolution: an alternative to induction variable substitution for dependence analysis
ICS '01 Proceedings of the 15th international conference on Supercomputing
Tracking down software bugs using automatic anomaly detection
Proceedings of the 24th International Conference on Software Engineering
BoomerAMG: a parallel algebraic multigrid solver and preconditioner
Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
Relative Debugging for Data-Parallel Programs: A ZPL Case Study
IEEE Concurrency
IEEE Transactions on Software Engineering
ROSE: An Optimizing Transformation System for C++ Array-Class Libraries
ECOOP '98 Workshop ion on Object-Oriented Technology
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Barrier matching for programs with textually unaligned barriers
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
PNMPI tools: a whole lot greater than the sum of their parts
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Lessons learned at 208K: towards debugging millions of cores
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
HOLMES: Effective statistical debugging via efficient path profiling
ICSE '09 Proceedings of the 31st International Conference on Software Engineering
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
GRace: a low-overhead mechanism for detecting data races in GPU programs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Vrisha: using scaling properties of parallel programs for bug detection and localization
Proceedings of the 20th international symposium on High performance distributed computing
Formal analysis of MPI-based parallel programs
Communications of the ACM
Large scale debugging of parallel tasks with AutomaDeD
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Probabilistic diagnosis of performance faults in large-scale parallel applications
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
Hi-index | 0.02 |
We present a scalable temporal order analysis technique that supports debugging of large scale applications by classifying MPI tasks based on their logical program execution order. Our approach combines static analysis techniques with dynamic analysis to determine this temporal order scalably. It uses scalable stack trace analysis techniques to guide selection of critical program execution points in anomalous application runs. Our novel temporal ordering engine then leverages this information along with the application's static control structure to apply data flow analysis techniques to determine key application data such as loop control variables. We then use lightweight techniques to gather the dynamic data that determines the temporal order of the MPI tasks. Our evaluation, which extends the Stack Trace Analysis Tool (STAT), demonstrates that this temporal order analysis technique can isolate bugs in benchmark codes with injected faults as well as a real world hang case with AMG2006.