Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Optimal checkpointing and local recording for domino-free rollback recovery
Information Processing Letters
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Checkpointing distributed applications on mobile computers
PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Finding Consistent Global Checkpoints in a Distributed Computation
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Software Engineering
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems
ACM SIGOPS Operating Systems Review
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Interval consistency of asynchronous distributed computations
Journal of Computer and System Sciences
Evaluating Distributed Checkpointing Protocol
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Selective Checkpointing and Rollbacks in Multithreaded Distributed Systems
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
On the Minimal Characterization of the Rollback-Dependency Trackability Property
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
On Properties of RDT Communication-Induced Checkpointing Protocols
IEEE Transactions on Parallel and Distributed Systems
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks
Journal of Parallel and Distributed Computing
A New Approach for High Performance Computing Systems with Various Checkpointing Schemes
The Journal of Supercomputing
Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation
DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Peer-to-Peer and fault-tolerance: Towards deployment-based technical services
Future Generation Computer Systems
On the Complexity of Removing Z-Cycles from a Checkpoints and Communication Pattern
IEEE Transactions on Computers
Self-stabilizing algorithm for checkpointing in a distributed system
Journal of Parallel and Distributed Computing
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
Data-stream-based global event monitoring using pairwise interactions
Journal of Parallel and Distributed Computing
A novel non-block synchronous checkpointing scheme for distributed systems
ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
ICS'08 Proceedings of the 12th WSEAS international conference on Systems
Information Sciences: an International Journal
An efficient and scalable checkpointing and recovery algorithm for distributed systems
ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
Extended mpijava for distributed checkpointing and recovery
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Self-stabilizing checkpointing algorithm in ring topology
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
A hybrid message Logging-CIC protocol for constrained checkpointability
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Parallel checkpointing on a grid-enabled java platform
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
A low-overhead non-block checkpointing algorithm for mobile computing environment
GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
Future Generation Computer Systems
A multi-cycle checkpointing protocol that ensures strict 1-rollback
Information Processing Letters
Hi-index | 0.01 |
Checkpointing algorithms are classified as synchronous and asynchronous in the literature. In synchronous checkpointing, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system. Synchronizing checkpointing activity involves message overhead and process execution may have to be suspended during the checkpointing coordination, resulting in performance degradation. In asynchronous checkpointing, processes take checkpoints without any coordination with others. Asynchronous checkpointing provides maximum autonomy for processes to take checkpoints; however, some of the checkpoints taken may not lie on any consistent global checkpoint, thus making the checkpointing efforts useless. Asynchronous checkpointing algorithms in the literature can reduce the number of useless checkpoints by making processes take communication induced checkpoints besides asynchronous checkpoints. We call such algorithms quasi-synchronous. In this paper, we present a theoretical framework for characterizing and classifying such algorithms. The theory not only helps to classify and characterize the quasi-synchronous checkpointing algorithms, but also helps to analyze the properties and limitations of the algorithms belonging to each class. It also provides guidelinesfor designing and evaluating such algorithms.