Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Processor allocation and checkpoint interval selection in cluster computing systems
Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Virtual Machine Based Heterogeneous Checkpointing
IPDPS '02 Proceedings of the 16th International Symposium on Parallel and Distributed Processing
Another Two-Level Failure Recovery Scheme
Another Two-Level Failure Recovery Scheme
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
A New Approach for High Performance Computing Systems with Various Checkpointing Schemes
The Journal of Supercomputing
In-network fault tolerance in networked sensor systems
DIWANS '06 Proceedings of the 2006 workshop on Dependability issues in wireless ad hoc networks and sensor networks
A Parsimonious Approach for Obtaining Resource-Efficient and Trustworthy Execution
IEEE Transactions on Dependable and Secure Computing
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
A novel fault-tolerant execution model by using of mobile agents
Journal of Network and Computer Applications
Numerical computation algorithms for sequential checkpoint placement
Performance Evaluation
Application-Level checkpointing techniques for parallel programs
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Hi-index | 0.00 |
This paper presents an objective measure, called overheadratio, for evaluating distributed checkpointing protocols.This measure extends previous evaluation schemes byincorporating several additional parameters that are inherentin distributed environments. In particular, we take intoaccount the rollback propagation of the protocol, which impactsthe length of the recovery process, and therefore theexpected program run-time in executions that involve failuresand recoveries. The paper also analyzes several knownprotocols and compares their overhead ratio.