Correlated set coordination in fault tolerant message logging protocols

Authors:
Aurelien Bouteiller;Thomas Herault;George Bosilca;Jack J. Dongarra
Affiliations:
Innovative Computing Laboratory, The University of Tennessee;Innovative Computing Laboratory, The University of Tennessee;Innovative Computing Laboratory, The University of Tennessee;Innovative Computing Laboratory, The University of Tennessee and Oak Ridge National Laboratory
Venue:
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Year:
2011

Citing 12
Cited 5

Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
MPI: a message passing interface

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Communication-Induced Determination of Consistent Snapshots

IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The Cost of Recovery in Message Logging Protocols

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Team-Based Message Logging: Preliminary Results

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Dodging the cost of unavoidable memory copies in message logging protocols

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications

HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Future Generation Computer Systems
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Post-failure recovery of MPI communication capability: Design and rationale

International Journal of High Performance Computing Applications
Multi-criteria checkpointing strategies: response-time versus resource utilization

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.