Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
MPI: a message passing interface
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The Cost of Recovery in Message Logging Protocols
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Team-Based Message Logging: Preliminary Results
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Dodging the cost of unavoidable memory copies in message logging protocols
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
Future Generation Computer Systems
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Post-failure recovery of MPI communication capability: Design and rationale
International Journal of High Performance Computing Applications
Multi-criteria checkpointing strategies: response-time versus resource utilization
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hi-index | 0.00 |
Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.