Large scale debugging of parallel tasks with AutomaDeD

  • Authors:
  • Ignacio Laguna;Todd Gamblin;Bronis R. de Supinski;Saurabh Bagchi;Greg Bronevetsky;Dong H. Anh;Martin Schulz;Barry Rountree

  • Affiliations:
  • Purdue University, West Lafayette, IN;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Purdue University, West Lafayette, IN;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA;Lawrence Livermore National Laboratory, Computation Directorate, Livermore, CA

  • Venue:
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Developing correct HPC applications continues to be a challenge as the number of cores increases in today's largest systems. Most existing debugging techniques perform poorly at large scales and do not automatically locate the parts of the parallel application in which the error occurs. The overhead of collecting large amounts of runtime information and an absence of scalable error detection algorithms generally cause poor scalability. In this work, we present novel, highly efficient techniques that facilitate the process of debugging large scale parallel applications. Our approach extends our previous work, AutomaDeD, in three major areas to isolate anomalous tasks in a scalable manner: (i) we efficiently compare elements of graph models (used in AutomaDeD to model parallel tasks) using pre-computed lookup-tables and by pointer comparison; (ii) we compress per-task graph models before the error detection analysis so that comparison between models involves many fewer elements; (iii) we use scalable sampling-based clustering and nearest-neighbor techniques to isolate abnormal tasks when bugs and performance anomalies are manifested. Our evaluation with fault injections shows that AutomaDeD scales well to thousands of tasks and that it can find anomalous tasks in under 5 seconds in an online manner.