Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

Authors:
P. Lemarinier;A. Bouteiller;T. Herault;G. Krawezik;F. Cappello
Affiliations:
LRI, Univ. de Paris Sud, Orsay, France;LRI, Univ. de Paris Sud, Orsay, France;LRI, Univ. de Paris Sud, Orsay, France;LRI, Univ. de Paris Sud, Orsay, France;LRI, Univ. de Paris Sud, Orsay, France
Venue:
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Year:
2004

Citing 0
Cited 13

Hybrid Preemptive Scheduling of MPI Applications on the Grids

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Towards highly available and scalable high performance clusters

Journal of Computer and System Sciences
Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing
Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Team-Based Message Logging: Preliminary Results

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Improving message logging protocols scalability through distributed event logging

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Dodging the cost of unavoidable memory copies in message logging protocols

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Towards building a highly-available cluster based model for high performance computing

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Correlated set coordination in fault tolerant message logging protocols

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The viability of using compression to decrease message log sizes

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are: 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. We extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging and the server stress of coordinated checkpoint. We detail the protocols and their implementation into the new MPICH-V fault tolerant framework. We compare their performance against the previous versions and we compare the novel message logging protocols against the improved coordinated checkpointing one using the NAS benchmark on a typical high performance cluster equipped with a high speed network. The contribution of This work is twofold: a) an original message logging protocol and an improved coordinated checkpointing protocol and b) the comparison between them.