Fault-tolerant broadcasts and related problems
Distributed systems (2nd Ed.)
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Efficient commit protocols for the tree of processes model of distributed transactions
ACM SIGOPS Operating Systems Review
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
International Journal of High Performance Computing Applications
Early experiments with the OpenMP/MPI hybrid programming model
IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Redesigning the message logging model for high performance
Concurrency and Computation: Practice & Experience - International Supercomputing Conference
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
HPC in phase change: towards a new execution model
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Run-through stabilization: an MPI proposal for process fault tolerance
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
User level failure mitigation in MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Fault tolerance using lower fidelity data in adaptive mesh applications
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Using MPI in high-performance computing services
Proceedings of the 20th European MPI Users' Group Meeting
Hi-index | 0.00 |
As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact aspects of the User-Level Failure Mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.