Run-through stabilization: an MPI proposal for process fault tolerance

Authors:
Joshua Hursey;Richard L. Graham;Greg Bronevetsky;Darius Buntinas;Howard Pritchard;David G. Solt
Affiliations:
Oak Ridge National Laboratory;Oak Ridge National Laboratory;Lawrence Livermore National Laboratory;Argonne National Laboratory;Cray, Inc.;Hewlett-Packard
Venue:
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Year:
2011

Citing 4
Cited 4

The consensus problem in fault-tolerant computing

ACM Computing Surveys (CSUR)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Toward Exascale Resilience

International Journal of High Performance Computing Applications

Enabling Application Resilience with and without the MPI Standard

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
An evaluation of user-level failure mitigation support in MPI

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Using MPI in high-performance computing services

Proceedings of the 20th European MPI Users' Group Meeting
An evaluation of User-Level Failure Mitigation support in MPI

Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum's Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard.