Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Chare kernel—a runtime support system for parallel computations
Journal of Parallel and Distributed Computing
Consistent global checkpoints based on direct dependency tracking
Information Processing Letters
Space reclamation for uncoordinated checkpointing in message-passing systems
Space reclamation for uncoordinated checkpointing in message-passing systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Event graph visualization for debugging large applications
SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
An Efficient Distributed Online Algorithm to Detect Strong Conjunctive Predicates
IEEE Transactions on Software Engineering
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Numerical computation algorithms for sequential checkpoint placement
Performance Evaluation
A multi-cycle checkpointing protocol that ensures strict 1-rollback
Information Processing Letters
The Journal of Supercomputing
Hi-index | 0.01 |
Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by $N(N+1)/2$ checkpoints where $N$ is the number of processes.Index Terms驴Fault tolerance, message-passing systems, uncoordinated checkpointing, rollback recovery, garbage collection.