Large scale debugging of parallel tasks with AutomaDeD

Authors:
Ignacio Laguna;Todd Gamblin;Bronis R. de Supinski;Saurabh Bagchi;Greg Bronevetsky;Dong H. Anh;Martin Schulz;Barry Rountree
Affiliations:
Purdue University, West Lafayette, IN;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Purdue University, West Lafayette, IN;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 11
Cited 3

BoomerAMG: a parallel algebraic multigrid solver and preconditioner

Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
PNMPI tools: a whole lot greater than the sum of their parts

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Lessons learned at 208K: towards debugging millions of cores

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scalable temporal order analysis for large scale debugging

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Clustering performance data efficiently at massive scales

Proceedings of the 24th ACM International Conference on Supercomputing
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

A Scalable Parallel Debugging Library with Pluggable Communication Protocols

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Probabilistic diagnosis of performance faults in large-scale parallel applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A scalable, non-parametric anomaly detection framework for Hadoop

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Developing correct HPC applications continues to be a challenge as the number of cores increases in today's largest systems. Most existing debugging techniques perform poorly at large scales and do not automatically locate the parts of the parallel application in which the error occurs. The overhead of collecting large amounts of runtime information and an absence of scalable error detection algorithms generally cause poor scalability. In this work, we present novel, highly efficient techniques that facilitate the process of debugging large scale parallel applications. Our approach extends our previous work, AutomaDeD, in three major areas to isolate anomalous tasks in a scalable manner: (i) we efficiently compare elements of graph models (used in AutomaDeD to model parallel tasks) using pre-computed lookup-tables and by pointer comparison; (ii) we compress per-task graph models before the error detection analysis so that comparison between models involves many fewer elements; (iii) we use scalable sampling-based clustering and nearest-neighbor techniques to isolate abnormal tasks when bugs and performance anomalies are manifested. Our evaluation with fault injections shows that AutomaDeD scales well to thousands of tasks and that it can find anomalous tasks in under 5 seconds in an online manner.