Replication for send-deterministic MPI HPC applications

Authors:
Arnaud Lefray;Thomas Ropars;André Schiper
Affiliations:
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Venue:
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Year:
2013

Citing 14
Cited 0

Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Understanding Replication in Databases and Distributed Systems

ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Modeling the Impact of Checkpoints on Next-Generation Systems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Redesigning the message logging model for high performance

Concurrency and Computation: Practice & Experience - International Supercomputing Conference
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Transparent redundant computing with MPI

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
File I/O for MPI Applications in Redundant Execution Scenarios

PDP '12 Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Combining Partial Redundancy and Checkpointing for HPC

ICDCS '12 Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Replication has recently gained attention in the context of fault tolerance for large scale MPI HPC applications. Existing implementations try to cover all MPI codes and to be independent from the underlying library. In this paper, we evaluate the advantages of adopting a different approach. First, we try to take advantage of a communication property common to many MPI HPC application, namely send-determinism. Second, we choose to implement replication inside the MPI library. The main advantage of our approach is simplicity. While being only a small patch to the Open MPI library, our solution called SDR-MPI supports most main features of the MPI standard including all collectives and group operations. SDR-MPI additionally achieves good performance: Experiments run with HPC benchmarks and applications show that its overhead remains below 5%.