Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
MPI: The Complete Reference
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Dynasa: adapting grid applications to safety using fault-tolerant methods
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Interconnect agnostic checkpoint/restart in open MPI
Proceedings of the 18th ACM international symposium on High performance distributed computing
Characterizing fault tolerance in genetic programming
BADS '09 Proceedings of the 2009 workshop on Bio-inspired algorithms for distributed systems
The Architecture of the XtreemOS Grid Checkpointing Service
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An adaptive and safe ubicomp for HPC applications
International Journal of Ad Hoc and Ubiquitous Computing
Characterizing fault tolerance in genetic programming
Future Generation Computer Systems
A scalable asynchronous replication-based strategy for fault tolerant MPI applications
HiPC'07 Proceedings of the 14th international conference on High performance computing
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and nonblocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.