Application-specific fault tolerance via data access characterization

  • Authors:
  • Nawab Ali;Sriram Krishnamoorthy;Niranjan Govind;Karol Kowalski;Ponnuswamy Sadayappan

  • Affiliations:
  • Pacific Northwest National Laboratory, Richland, WA;Pacific Northwest National Laboratory, Richland, WA;Pacific Northwest National Laboratory, Richland, WA;Pacific Northwest National Laboratory, Richland, WA;The Ohio State University, Columbus, OH

  • Venue:
  • Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent trends in semiconductor technology and supercomputer design predict an increasing probability of faults during an application's execution. Designing an application that is resilient to system failures requires careful evaluation of the impact of various approaches on preserving key application state. In this paper, we present our experiences in an ongoing effort to make a large computational chemistry application fault tolerant. We construct the data access signatures of key application modules to evaluate alternative fault tolerance approaches. We present the instrumentation methodology, characterization of the application modules, and evaluation of fault tolerance techniques using the information collected. The application signatures developed capture application characteristics not traditionally revealed by performance tools. We believe these can be used in the design and evaluation of runtimes beyond fault tolerance.