Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Authors:
Camille Coti;Thomas Herault;Pierre Lemarinier;Laurence Pilard;Ala Rezmerita;Eric Rodriguez;Franck Cappello
Affiliations:
Université, Paris-XI, France;Université, Paris-XI, France;Université, Paris-XI, France;Université, Paris-XI, France;Université, Paris-XI, France;Université, Paris-XI, France;Université, Paris-XI, France
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 11
Cited 9

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Communication-Induced Determination of Consistent Snapshots

IEEE Transactions on Parallel and Distributed Systems
MPI: The Complete Reference

MPI: The Complete Reference
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing

Dynasa: adapting grid applications to safety using fault-tolerant methods

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing
Characterizing fault tolerance in genetic programming

BADS '09 Proceedings of the 2009 workshop on Bio-inspired algorithms for distributed systems
The Architecture of the XtreemOS Grid Checkpointing Service

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An adaptive and safe ubicomp for HPC applications

International Journal of Ad Hoc and Ubiquitous Computing
Characterizing fault tolerance in genetic programming

Future Generation Computer Systems
A scalable asynchronous replication-based strategy for fault tolerant MPI applications

HiPC'07 Proceedings of the 14th international conference on High performance computing
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and nonblocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.