Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.

Authors:
Yi-Min Wang;Pi-Yu Chung;In-Jen Lin;W. Kent Fuchs
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1995

Citing 12
Cited 8

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Chare kernel—a runtime support system for parallel computations

Journal of Parallel and Distributed Computing
Consistent global checkpoints based on direct dependency tracking

Information Processing Letters
Space reclamation for uncoordinated checkpointing in message-passing systems

Space reclamation for uncoordinated checkpointing in message-passing systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles

Event graph visualization for debugging large applications

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
An Efficient Distributed Online Algorithm to Detect Strong Conjunctive Predicates

IEEE Transactions on Software Engineering
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Numerical computation algorithms for sequential checkpoint placement

Performance Evaluation
A multi-cycle checkpointing protocol that ensures strict 1-rollback

Information Processing Letters
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by $N(N+1)/2$ checkpoints where $N$ is the number of processes.Index Terms驴Fault tolerance, message-passing systems, uncoordinated checkpointing, rollback recovery, garbage collection.