Evaluating Distributed Checkpointing Protocol

Authors:
Adnan Agbaria;Ari Freund;Roy Friedman
Affiliations:
-;-;-
Venue:
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Year:
2003

Citing 14
Cited 8

Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
Processor allocation and checkpoint interval selection in cluster computing systems

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Virtual Machine Based Heterogeneous Checkpointing

IPDPS '02 Proceedings of the 16th International Symposium on Parallel and Distributed Processing
Another Two-Level Failure Recovery Scheme

Another Two-Level Failure Recovery Scheme

Quantifying rollback propagation in distributed checkpointing

Journal of Parallel and Distributed Computing
A New Approach for High Performance Computing Systems with Various Checkpointing Schemes

The Journal of Supercomputing
In-network fault tolerance in networked sensor systems

DIWANS '06 Proceedings of the 2006 workshop on Dependability issues in wireless ad hoc networks and sensor networks
A Parsimonious Approach for Obtaining Resource-Efficient and Trustworthy Execution

IEEE Transactions on Dependable and Secure Computing
Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
A novel fault-tolerant execution model by using of mobile agents

Journal of Network and Computer Applications
Numerical computation algorithms for sequential checkpoint placement

Performance Evaluation
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an objective measure, called overheadratio, for evaluating distributed checkpointing protocols.This measure extends previous evaluation schemes byincorporating several additional parameters that are inherentin distributed environments. In particular, we take intoaccount the rollback propagation of the protocol, which impactsthe length of the recovery process, and therefore theexpected program run-time in executions that involve failuresand recoveries. The paper also analyzes several knownprotocols and compares their overhead ratio.