Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

Authors:
D. Manivannan;Mukesh Singhal
Affiliations:
Univ. of Kentucky, Lexington;The Ohio State Univ., Columbus
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1999

Citing 12
Cited 35

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Optimal checkpointing and local recording for domino-free rollback recovery

Information Processing Letters
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Logical Time: Capturing Causality in Distributed Systems

Computer
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability

IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Checkpointing distributed applications on mobile computers

PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Finding Consistent Global Checkpoints in a Distributed Computation

IEEE Transactions on Parallel and Distributed Systems
Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation

IEEE Transactions on Software Engineering
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems

ACM SIGOPS Operating Systems Review
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Interval consistency of asynchronous distributed computations

Journal of Computer and System Sciences
Evaluating Distributed Checkpointing Protocol

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Selective Checkpointing and Rollbacks in Multithreaded Distributed Systems

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
On the Minimal Characterization of the Rollback-Dependency Trackability Property

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
On Properties of RDT Communication-Induced Checkpointing Protocols

IEEE Transactions on Parallel and Distributed Systems
Quantifying rollback propagation in distributed checkpointing

Journal of Parallel and Distributed Computing
The development of an efficient checkpointing facility exploiting operating systems services of the GENESIS cluster operating system

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Journal of Parallel and Distributed Computing
A New Approach for High Performance Computing Systems with Various Checkpointing Schemes

The Journal of Supercomputing
Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation

DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Peer-to-Peer and fault-tolerance: Towards deployment-based technical services

Future Generation Computer Systems
On the Complexity of Removing Z-Cycles from a Checkpoints and Communication Pattern

IEEE Transactions on Computers
Self-stabilizing algorithm for checkpointing in a distributed system

Journal of Parallel and Distributed Computing
Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
Data-stream-based global event monitoring using pairwise interactions

Journal of Parallel and Distributed Computing
A novel non-block synchronous checkpointing scheme for distributed systems

ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems

Journal of Parallel and Distributed Computing
Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions

ICS'08 Proceedings of the 12th WSEAS international conference on Systems
Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database system

Information Sciences: an International Journal
Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families

Performance Evaluation
An efficient and scalable checkpointing and recovery algorithm for distributed systems

ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
Extended mpijava for distributed checkpointing and recovery

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Self-stabilizing checkpointing algorithm in ring topology

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
A hybrid message Logging-CIC protocol for constrained checkpointability

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Parallel checkpointing on a grid-enabled java platform

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
A low-overhead non-block checkpointing algorithm for mobile computing environment

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Future Generation Computer Systems
A multi-cycle checkpointing protocol that ensures strict 1-rollback

Information Processing Letters

Quantified Score

Hi-index	0.01

Visualization

Abstract

Checkpointing algorithms are classified as synchronous and asynchronous in the literature. In synchronous checkpointing, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system. Synchronizing checkpointing activity involves message overhead and process execution may have to be suspended during the checkpointing coordination, resulting in performance degradation. In asynchronous checkpointing, processes take checkpoints without any coordination with others. Asynchronous checkpointing provides maximum autonomy for processes to take checkpoints; however, some of the checkpoints taken may not lie on any consistent global checkpoint, thus making the checkpointing efforts useless. Asynchronous checkpointing algorithms in the literature can reduce the number of useless checkpoints by making processes take communication induced checkpoints besides asynchronous checkpoints. We call such algorithms quasi-synchronous. In this paper, we present a theoretical framework for characterizing and classifying such algorithms. The theory not only helps to classify and characterize the quasi-synchronous checkpointing algorithms, but also helps to analyze the properties and limitations of the algorithms belonging to each class. It also provides guidelinesfor designing and evaluating such algorithms.