Recoverable Distributed Shared Virtual Memory
IEEE Transactions on Computers
Introduction to algorithms
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Space reclamation for uncoordinated checkpointing in message-passing systems
Space reclamation for uncoordinated checkpointing in message-passing systems
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
U-Net: a user-level network interface for parallel and distributed computing
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
Evaluations of domino-free communication-induced checkpointing protocols
Information Processing Letters
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Rollback-dependency trackability: a minimal characterization and its protocol
Information and Computation
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Preventing Useless Checkpoints in Distributed Computations
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
The Cost of Recovery in Message Logging Protocols
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Evaluating Distributed Checkpointing Protocol
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Communication-based prevention of useless checkpoints in distributed computations
Distributed Computing
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
ICS'08 Proceedings of the 12th WSEAS international conference on Systems
A multi-cycle checkpointing protocol that ensures strict 1-rollback
Information Processing Letters
Hi-index | 0.00 |
This paper proposes a new classification of executions with checkpoints based on the amount of rollback during recovery. Specifically, an execution is k-rollback, if k indicates the maximal number of checkpoints that have to be rolled back. It is shown that coordinated checkpointing, SZPF, and ZPF are 1-rollback, while ZCF is (n - 1)-rollback, where n is the number of participants in an execution.A new class of executions, called d-bounded cycles (in short, d-BC), is introduced, and is shown to be ((n - 1)ċ d)-rollback (ZCF is a special case of d-BC for d = 1).Finally, a protocol is presented whose executions are d-bounded cycles. A nice property of this protocol is that it does not impose any control information overhead on application messages, yet sends only a few control messages of its own. Moreover, the protocol maintains information that enables very efficient discovery of a recent recovery line that existed shortly before the failure.