Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms

Authors:
Zhao Peng;Alexey Lastovetsky
Affiliations:
University College Dublin, Belfield, Ireland;University College Dublin, Belfield, Ireland
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Year:
2005

Citing 18
Cited 0

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Memory exclusion: optimizing the performance of checkpointing systems

Software—Practice & Experience
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Concurrent Robust Checkpointing and Recovery in Distributed Systems

Proceedings of the Fourth International Conference on Data Engineering
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
System structure for software fault tolerance

Proceedings of the international conference on Reliable software
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Parallel Computing on Heterogeneous Networks

Parallel Computing on Heterogeneous Networks
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification

Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification
A network-failure-tolerant message-passing system for terascale clusters

International Journal of Parallel Programming
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Chandy-Lamport checkpointing algorithm is widely used in fault tolerant implementations of MPI. However, it assumes the FIFO property of message passing, which is not guaranteed by the MPI standard at the application level. Therefore, this algorithm cannot serve as a basis for an implementation-independent fault tolerant MPI. In this paper, we present a variant of the Chandy-Lamport algorithm that does not rely on the FIFO property. This algorithm can be implemented on top of MPI and, hence, used for development of a supplement software component enabling the fault tolerance of any MPI implementation compliant with the MPI standard. We prove the correctness of the algorithm and analyze its performance. Experimental results demonstrating the efficiency of the algorithm are also presented.