The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
International Journal of High Performance Computing Applications
Enabling Application Resilience with and without the MPI Standard
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
An evaluation of user-level failure mitigation support in MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Using MPI in high-performance computing services
Proceedings of the 20th European MPI Users' Group Meeting
Hi-index | 0.00 |
The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum's Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard.