Communication-Induced Determination of Consistent Snapshots

Authors:
Jean-Michel Hélary;Achour Mostefaoui;Michel Raynal
Affiliations:
IRISA, Rennes, France;IRISA, Rennes, France;IRISA, Rennes, France
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1999

Citing 19
Cited 13

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Restoring consistent global states of distributed computations

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A unified framework for the specification and run-time detection of dynamic properties in distributed computations

Journal of Systems and Software - Special issue on software engineering for distributed computing
Detection of Strong Unstable Predicates in Distributed Programs

IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Checkpointing distributed applications on mobile computers

PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Finding Consistent Global Checkpoints in a Distributed Computation

IEEE Transactions on Parallel and Distributed Systems
Consistency Issues in Distributed Checkpoints

IEEE Transactions on Software Engineering
Virtual Precedence in Asynchronous Systems: Cencept and Applications

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Detecting conjunctive channel predicates in a distributed programming environment

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Replaying Distributed Programs Without Message Logging

HPDC '97 Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing
An Index-Based Checkpointing Algorithm For Autonomous Distributed Systems

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Preventing Useless Checkpoints in Distributed Computations

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
A Distributed Consistent Global Checkpoint Algorithm with a Minimum Number of Checkpoints

ICOIN '98 Proceedings of the 13th International Conference on Information Networking
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

Interval consistency of asynchronous distributed computations

Journal of Computer and System Sciences
Shortcut Replay: A Replay Technique for Debugging Long-Running Parallel Programs

ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Error detection in large-scale parallel programs with long runtimes

Future Generation Computer Systems - Tools for program development and analysis
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Data-stream-based global event monitoring using pairwise interactions

Journal of Parallel and Distributed Computing
FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems

Journal of Parallel and Distributed Computing
Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions

ICS'08 Proceedings of the 12th WSEAS international conference on Systems
Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database system

Information Sciences: an International Journal
ROS: the rollback-one-step method to minimize the waiting time during debugging long-running parallel programs

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families

Performance Evaluation
Correlated set coordination in fault tolerant message logging protocols

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
An efficient computing-checkpoint based coordinated checkpoint algorithm

EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
Using computing checkpoints implement consistent low-cost non-blocking coordinated checkpointing

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

A classical way to determine consistent snapshots consists in using Chandy-Lamport's algorithm. This algorithm relies on specific control messages that allow processes to synchronize local checkpoint determination and message recording in order for the resulting snapshot to be consistent. This paper investigates a communication-induced approach to determine consistent snapshots. In such an approach, control information is carried out by application messages. Two abstract necessary and sufficient conditions are stated: one associated with global checkpoint consistency, the other associated with message recording. A general protocol is derived from these abstract conditions. Actually, this general protocol can be instantiated in distinct ways, giving rise to a family of communication-induced snapshot protocols. This general protocol shows there is an intrinsic trade-off between the number of forced checkpoints and the number of recorded messages. Finally, a particular instantiation of the general protocol is provided.