Redesigning the message logging model for high performance

Authors:
Aurelien Bouteiller;George Bosilca;Jack Dongarra
Affiliations:
ICL, University of Tennessee Knoxville, Claxton 1122 Volunteer Boulevard, Knoxville, TN 37996, U.S.A.;ICL, University of Tennessee Knoxville, Claxton 1122 Volunteer Boulevard, Knoxville, TN 37996, U.S.A.;ICL, University of Tennessee Knoxville, Claxton 1122 Volunteer Boulevard, Knoxville, TN 37996, U.S.A.
Venue:
Concurrency and Computation: Practice & Experience - International Supercomputing Conference
Year:
2010

Citing 0
Cited 7

Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An evaluation of user-level failure mitigation support in MPI

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Replication for send-deterministic MPI HPC applications

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Post-failure recovery of MPI communication capability: Design and rationale

International Journal of High Performance Computing Applications
An evaluation of User-Level Failure Mitigation support in MPI

Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the past decade the number of processors used in high performance computing has increased to hundreds of thousands. As a direct consequence, and while the computational power follows the trend, the mean time between failures (MTBF) has suffered and is now being counted in hours. In order to circumvent this limitation, a number of fault-tolerant algorithms as well as execution environments have been developed using the message passing paradigm. Among them, message logging has been proved to achieve a better overall performance when the MTBF is low, mainly due to a faster failure recovery. However, message logging suffers from a high overhead when no failure occurs. Therefore, in this paper we discuss a refinement of the message logging model intended to improve the failure-free message logging performance. The proposed approach simultaneously removes useless memory copies and reduces the number of logged events. We present the implementation of a pessimistic message logging protocol in Open MPI and compare it with the previous reference implementation MPICH-V2. The results outline a several order of magnitude improvement on the performance and a zero overhead for most messages. This article is a U.S. Government work and is in the public domain in the U.S.A.