Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
The Relative Overhead of Piggybacking in Causal Message Logging Protocols
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
The Cost of Recovery in Message Logging Protocols
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
An Efficient Algorithm for Causal Message Logging
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Message logging: pessimistic, optimistic, and causal
ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Active Optimistic Message Logging for Reliable Execution of MPI Applications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Improving message logging protocols scalability through distributed event logging
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault detection and handling. For this last approach, several protocols have been proposed in the literature. In a recent paper, we have demonstrated that uncoordinated checkpointing tolerates higher fault frequency than coordinated checkpointing.Moreover causal message logging protocols have been proved the most efficient message logging technique. These protocols consist in piggybacking non deterministic events to computation message. Several protocols have been proposed in the literature. Their merits are usually evaluated from four metrics: a) piggybacking computation cost, b) piggyback size, c) applications performance and d) fault recovery performance. In this paper, we investigate the benefit of using a stable storage for logging message events in causal message logging protocols. To evaluate the advantage of this technique we implemented three protocols: 1) a classical causal message protocol proposed in Manetho, 2) a state of the art protocol known as LogOn, 3) a light computation cost protocol called Vcausal. We demonstrate a major impact of this stable storage for the three protocols, on the four criteria for micro benchmarks as well as for the NAS benchmark.