Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Information Processing Letters
Detection of stable properties in distributed applications
PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Logical Time in Distributed Computing Systems
Computer - Distributed computing systems: separate resources acting as one
Consistent detection of global predicates
PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Journal of Systems and Software - Special issue on software engineering for distributed computing
Adaptive recovery for mobile environments
Communications of the ACM
Detection of Strong Unstable Predicates in Distributed Programs
IEEE Transactions on Parallel and Distributed Systems
Distributed breakpoint detection in message-passing programs
Journal of Parallel and Distributed Computing
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
Rollback-dependency trackability: visible characterizations
Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
Evaluations of domino-free communication-induced checkpointing protocols
Information Processing Letters
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Checkpointing distributed applications on mobile computers
PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Communication-Induced Determination of Consistent Snapshots
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
On-the-Fly Detection of Conjunctions of Local Predicates in Distributed Computations
SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Tracking immediate predecessors in distributed computations
Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems
ACM SIGOPS Operating Systems Review
Interval consistency of asynchronous distributed computations
Journal of Computer and System Sciences
On Properties of RDT Communication-Induced Checkpointing Protocols
IEEE Transactions on Parallel and Distributed Systems
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
A New Approach for High Performance Computing Systems with Various Checkpointing Schemes
The Journal of Supercomputing
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages
IEEE Transactions on Dependable and Secure Computing
On the Complexity of Removing Z-Cycles from a Checkpoints and Communication Pattern
IEEE Transactions on Computers
An enhanced model-based checkpointing protocol
PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Journal of Parallel and Distributed Computing
Information Sciences: an International Journal
Implementing rollback-recovery coordinated checkpoints
ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Future Generation Computer Systems
A multi-cycle checkpointing protocol that ensures strict 1-rollback
Information Processing Letters
Hi-index | 0.00 |
A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design communication-induced checkpointing protocols that direct processes to take additional local (forced) checkpoints to ensure no local checkpoint is useless. The paper first proves two properties related to integer timestamps which are associated with each local checkpoint. The first property is a necessary and sufficient condition that these timestamps must satisfy for no checkpoint to be useless. The second property provides an easy timestamp-based determination of consistent global checkpoints. Then, a general communication-induced checkpointing protocol is proposed. This protocol, derived from the two previous properties, actually defines a family of timestamp-based communication-induced checkpointing protocols. It is shown that several existing checkpointing protocols for the same problem are particular instances of the general protocol. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in "consistent global checkpoint"-based distributed applications such as the detection of stable or unstable properties and the determination of distributed breakpoints.