Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

Authors:
Aurelien Bouteiller;Boris Collin;Thomas Herault;Pierre Lemarinier;Franck Cappello
Affiliations:
INRIA/LRI, Université Paris-Sud, Orsay, France;INRIA/LRI, Université Paris-Sud, Orsay, France;INRIA/LRI, Université Paris-Sud, Orsay, France;INRIA/LRI, Université Paris-Sud, Orsay, France;INRIA/LRI, Université Paris-Sud, Orsay, France
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Year:
2005

Citing 11
Cited 3

Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
The Relative Overhead of Piggybacking in Causal Message Logging Protocols

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
The Cost of Recovery in Message Logging Protocols

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
An Efficient Algorithm for Causal Message Logging

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Message logging: pessimistic, optimistic, and causal

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing

Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Improving message logging protocols scalability through distributed event logging

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault detection and handling. For this last approach, several protocols have been proposed in the literature. In a recent paper, we have demonstrated that uncoordinated checkpointing tolerates higher fault frequency than coordinated checkpointing.Moreover causal message logging protocols have been proved the most efficient message logging technique. These protocols consist in piggybacking non deterministic events to computation message. Several protocols have been proposed in the literature. Their merits are usually evaluated from four metrics: a) piggybacking computation cost, b) piggyback size, c) applications performance and d) fault recovery performance. In this paper, we investigate the benefit of using a stable storage for logging message events in causal message logging protocols. To evaluate the advantage of this technique we implemented three protocols: 1) a classical causal message protocol proposed in Manetho, 2) a state of the art protocol known as LogOn, 3) a light computation cost protocol called Vcausal. We demonstrate a major impact of this stable storage for the three protocols, on the four criteria for micro benchmarks as well as for the NAS benchmark.