Model-based performance evaluation of distributed checkpointing protocols

Authors:
Adnan Agbaria;Roy Friedman
Affiliations:
IBM Haifa Research Lab, Mount Carmel, Haifa 31905, Israel;Computer Science Department, Technion - Israel Institute of Technology, Haifa 32000, Israel
Venue:
Performance Evaluation
Year:
2008

Citing 32
Cited 3

Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
Processor allocation and checkpoint interval selection in cluster computing systems

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Virtual-machine-based heterogeneous checkpointing

Software—Practice & Experience
Design, Implementation, and Performance of Checkpointing in NetSolve

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
An Object-Oriented Testbed for the Evaluation of Checkpointing and Recovery Systems

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Communication-Induced Determination of Consistent Snapshots

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Maximum and minimum consistent global checkpoints and their applications

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Evaluating Distributed Checkpointing Protocol

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
The Möbius Modeling Tool

PNPM '01 Proceedings of the 9th international Workshop on Petri Nets and Performance Models (PNPM'01)
Another Two-Level Failure Recovery Scheme

Another Two-Level Failure Recovery Scheme
Consistent Logical Checkpointing

Consistent Logical Checkpointing
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance

An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Quantifying rollback propagation in distributed checkpointing

Journal of Parallel and Distributed Computing
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database system

Information Sciences: an International Journal
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families

Performance Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large number of distributed checkpointing protocols have appeared in the literature. However, to make informed decisions about which protocol performs best for a given environment, one must use an objective measure for comparing them. Obviously, a distributed checkpointing protocol could be the best in a specific environment, but not in another environment. This paper presents an objective measure, called overhead ratio, for evaluating distributed checkpointing protocols. This measure extends previous evaluation schemes by incorporating several additional parameters that are inherent in distributed environments. In particular, we take into account the rollback propagation of the protocol, which impacts the length of the recovery process, and therefore the expected program run-time in executions that involve failures and recoveries. Using the objective measure as an evaluation technique, the paper also analyses several known protocols and compares their overhead ratios.