Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
A first order approximation to the optimum checkpoint interval
Communications of the ACM
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Processor allocation and checkpoint interval selection in cluster computing systems
Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Virtual-machine-based heterogeneous checkpointing
Software—Practice & Experience
Design, Implementation, and Performance of Checkpointing in NetSolve
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
An Object-Oriented Testbed for the Evaluation of Checkpointing and Recovery Systems
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Communication-Induced Determination of Consistent Snapshots
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Maximum and minimum consistent global checkpoints and their applications
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Evaluating Distributed Checkpointing Protocol
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
PNPM '01 Proceedings of the 9th international Workshop on Petri Nets and Performance Models (PNPM'01)
Another Two-Level Failure Recovery Scheme
Another Two-Level Failure Recovery Scheme
Consistent Logical Checkpointing
Consistent Logical Checkpointing
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Information Sciences: an International Journal
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
A large number of distributed checkpointing protocols have appeared in the literature. However, to make informed decisions about which protocol performs best for a given environment, one must use an objective measure for comparing them. Obviously, a distributed checkpointing protocol could be the best in a specific environment, but not in another environment. This paper presents an objective measure, called overhead ratio, for evaluating distributed checkpointing protocols. This measure extends previous evaluation schemes by incorporating several additional parameters that are inherent in distributed environments. In particular, we take into account the rollback propagation of the protocol, which impacts the length of the recovery process, and therefore the expected program run-time in executions that involve failures and recoveries. Using the objective measure as an evaluation technique, the paper also analyses several known protocols and compares their overhead ratios.