Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Performance Evaluation of the Quadrics Interconnection Network
Cluster Computing
Message logging: pessimistic, optimistic, and causal
ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
International Journal of High Performance Computing Applications
Redesigning the message logging model for high performance
Concurrency and Computation: Practice & Experience - International Supercomputing Conference
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
Correlated set coordination in fault tolerant message logging protocols
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery.