Probabilistic diagnosis of performance faults in large-scale parallel applications

  • Authors:
  • Ignacio Laguna;Dong H. Ahn;Bronis R. de Supinski;Saurabh Bagchi;Todd Gamblin

  • Affiliations:
  • Purdue University, West Lafayette, IN, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;Purdue University, West Lafayette, IN, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA

  • Venue:
  • Proceedings of the 21st international conference on Parallel architectures and compilation techniques
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and to fix performance failures and correctness problems at scale. Our tool probabilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demonstrate that our tool can isolate the root cause of a particularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least progressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.