Communication-based prevention of useless checkpoints in distributed computations

Authors:
J.-M. Hélary;A. Mostefaoui;R. H. B. Netzer;M. Raynal
Affiliations:
IRISA, Université de Rennes, Campus de Beaulieu, F-35042 Rennes Cedex, France;IRISA, Université de Rennes, Campus de Beaulieu, F-35042 Rennes Cedex, France;Computer Science Department, Brown University, Box 1910, Providence, RI;IRISA, Université de Rennes, Campus de Beaulieu, F-35042 Rennes Cedex, France
Venue:
Distributed Computing
Year:
2000

Citing 20
Cited 15

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
On distributed snapshots

Information Processing Letters
Detection of stable properties in distributed applications

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Logical Time in Distributed Computing Systems

Computer - Distributed computing systems: separate resources acting as one
Consistent detection of global predicates

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A unified framework for the specification and run-time detection of dynamic properties in distributed computations

Journal of Systems and Software - Special issue on software engineering for distributed computing
Adaptive recovery for mobile environments

Communications of the ACM
Detection of Strong Unstable Predicates in Distributed Programs

IEEE Transactions on Parallel and Distributed Systems
Distributed breakpoint detection in message-passing programs

Journal of Parallel and Distributed Computing
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability

IEEE Transactions on Parallel and Distributed Systems
Rollback-dependency trackability: visible characterizations

Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
Evaluations of domino-free communication-induced checkpointing protocols

Information Processing Letters
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Checkpointing distributed applications on mobile computers

PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Communication-Induced Determination of Consistent Snapshots

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
On-the-Fly Detection of Conjunctions of Local Predicates in Distributed Computations

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

Tracking immediate predecessors in distributed computations

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems

ACM SIGOPS Operating Systems Review
Interval consistency of asynchronous distributed computations

Journal of Computer and System Sciences
On Properties of RDT Communication-Induced Checkpointing Protocols

IEEE Transactions on Parallel and Distributed Systems
Quantifying rollback propagation in distributed checkpointing

Journal of Parallel and Distributed Computing
A New Approach for High Performance Computing Systems with Various Checkpointing Schemes

The Journal of Supercomputing
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages

IEEE Transactions on Dependable and Secure Computing
On the Complexity of Removing Z-Cycles from a Checkpoints and Communication Pattern

IEEE Transactions on Computers
An enhanced model-based checkpointing protocol

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems

Journal of Parallel and Distributed Computing
Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database system

Information Sciences: an International Journal
Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families

Performance Evaluation
Implementing rollback-recovery coordinated checkpoints

ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Future Generation Computer Systems
A multi-cycle checkpointing protocol that ensures strict 1-rollback

Information Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design communication-induced checkpointing protocols that direct processes to take additional local (forced) checkpoints to ensure no local checkpoint is useless. The paper first proves two properties related to integer timestamps which are associated with each local checkpoint. The first property is a necessary and sufficient condition that these timestamps must satisfy for no checkpoint to be useless. The second property provides an easy timestamp-based determination of consistent global checkpoints. Then, a general communication-induced checkpointing protocol is proposed. This protocol, derived from the two previous properties, actually defines a family of timestamp-based communication-induced checkpointing protocols. It is shown that several existing checkpointing protocols for the same problem are particular instances of the general protocol. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in "consistent global checkpoint"-based distributed applications such as the detection of stable or unstable properties and the determination of distributed breakpoints.