An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
SFT: a consistent checkpointing algorithm with shorter freezing time
ACM SIGOPS Operating Systems Review
SCR algorithm: saving/restoring states of file systems
ACM SIGOPS Operating Systems Review
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
The Journal of Supercomputing
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Protocol for Taking Object-Based Checkpoints
DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
Deadlocks in fully uncoordinated checkpointing rollback recovery systems
WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Object-Based Checkpoints in Distributed Systems
WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Checkpoint and Rollback in Asynchronous Distributed Systems
INFOCOM '97 Proceedings of the INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution
On designing direct dependency: based fast recovery algorithms for distributed systems
ACM SIGOPS Operating Systems Review
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks
Journal of Parallel and Distributed Computing
Communication-based prevention of useless checkpoints in distributed computations
Distributed Computing
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages
IEEE Transactions on Dependable and Secure Computing
Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Quasi-atomic recovery for distributed agents
Parallel Computing
Promised messages: recovering from inconsistent global states
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
WSEAS Transactions on Computers
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
ICS'08 Proceedings of the 12th WSEAS international conference on Systems
A weighted checkpointing protocol for mobile distributed systems
International Journal of Ad Hoc and Ubiquitous Computing
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Dodging the cost of unavoidable memory copies in message logging protocols
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
An efficient and scalable checkpointing and recovery algorithm for distributed systems
ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Garbage collection in a causal message logging protocol
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
A hybrid message Logging-CIC protocol for constrained checkpointability
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A multi-cycle checkpointing protocol that ensures strict 1-rollback
Information Processing Letters
Reversible simulations of elastic collisions
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Hi-index | 0.00 |
In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easiness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process needs only to rollback to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. To avoid domino effect, it uses selective pessimistic message logging at the receiver end. The recovery is asynchronous for single process failure. Neither the recovery algorithm nor the checkpointing algorithm requires the channels to be FIFO. We do not use vector timestamps for determining dependency between checkpoints since vector timestamps generally result in high message overhead during failure-free operation.