Run-through stabilization: an MPI proposal for process fault tolerance

  • Authors:
  • Joshua Hursey;Richard L. Graham;Greg Bronevetsky;Darius Buntinas;Howard Pritchard;David G. Solt

  • Affiliations:
  • Oak Ridge National Laboratory;Oak Ridge National Laboratory;Lawrence Livermore National Laboratory;Argonne National Laboratory;Cray, Inc.;Hewlett-Packard

  • Venue:
  • EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum's Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard.