The program dependence graph and its use in optimization
ACM Transactions on Programming Languages and Systems (TOPLAS)
Dynamic slicing of computer programs
Journal of Systems and Software
Parallel program performance metrics: a comprison and validation
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Relative debugging: a new methodology for debugging scientific applications
Communications of the ACM
Dynamic software testing of MPI applications with umpire
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Relative Debugging for Data-Parallel Programs: A ZPL Case Study
IEEE Concurrency
ICSE '81 Proceedings of the 5th international conference on Software engineering
Predicate-Based Dynamic Slicing of Message Passing Programs
SCAM '02 Proceedings of the Second IEEE International Workshop on Source Code Analysis and Manipulation
Dynamic slicing of distributed programs
ICSM '95 Proceedings of the International Conference on Software Maintenance
Dynamic Slicing of Parallel Message-Passing Programs
PDP '96 Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96)
Extending a traditional debugger to debug massively parallel applications
Journal of Parallel and Distributed Computing
Analyzing Message-Passing Programs through Visual Slicing
ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II - Volume 02
Data-Flow Analysis for MPI Programs
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Developing Scientific Applications Using Eclipse
Computing in Science and Engineering
Concurrent deadlock detection in parallel programs
International Journal of Computers and Applications
Problem diagnosis in large-scale computing environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A graph based approach for MPI deadlock detection
Proceedings of the 23rd international conference on Supercomputing
Scalable temporal order analysis for large scale debugging
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Assertion Based Parallel Debugging
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Large scale debugging of parallel tasks with AutomaDeD
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
Hi-index | 0.00 |
Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and to fix performance failures and correctness problems at scale. Our tool probabilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demonstrate that our tool can isolate the root cause of a particularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least progressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.