A low-overhead recovery technique using quasi-synchronous checkpointing

Authors:
Affiliations:
Venue:
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Year:
1996

Citing 0
Cited 37

An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
SFT: a consistent checkpointing algorithm with shorter freezing time

ACM SIGOPS Operating Systems Review
SCR algorithm: saving/restoring states of file systems

ACM SIGOPS Operating Systems Review
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots

IEEE Transactions on Parallel and Distributed Systems
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Protocol for Taking Object-Based Checkpoints

DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
Deadlocks in fully uncoordinated checkpointing rollback recovery systems

WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Object-Based Checkpoints in Distributed Systems

WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Checkpoint and Rollback in Asynchronous Distributed Systems

INFOCOM '97 Proceedings of the INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution
On designing direct dependency: based fast recovery algorithms for distributed systems

ACM SIGOPS Operating Systems Review
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Journal of Parallel and Distributed Computing
Communication-based prevention of useless checkpoints in distributed computations

Distributed Computing
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages

IEEE Transactions on Dependable and Secure Computing
Finding a suitable checkpoint and recovery protocol for a distributed application

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Quasi-atomic recovery for distributed agents

Parallel Computing
Promised messages: recovering from inconsistent global states

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
Lightweight log management algorithm for removing logged messages of sender processes with little overhead

WSEAS Transactions on Computers
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems

Journal of Parallel and Distributed Computing
Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions

ICS'08 Proceedings of the 12th WSEAS international conference on Systems
A weighted checkpointing protocol for mobile distributed systems

International Journal of Ad Hoc and Ubiquitous Computing
ROS: the rollback-one-step method to minimize the waiting time during debugging long-running parallel programs

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Dodging the cost of unavoidable memory copies in message logging protocols

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families

Performance Evaluation
An efficient and scalable checkpointing and recovery algorithm for distributed systems

ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Garbage collection in a causal message logging protocol

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
A hybrid message Logging-CIC protocol for constrained checkpointability

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A multi-cycle checkpointing protocol that ensures strict 1-rollback

Information Processing Letters
Reversible simulations of elastic collisions

ACM Transactions on Modeling and Computer Simulation (TOMACS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easiness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process needs only to rollback to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. To avoid domino effect, it uses selective pessimistic message logging at the receiver end. The recovery is asynchronous for single process failure. Neither the recovery algorithm nor the checkpointing algorithm requires the channels to be FIFO. We do not use vector timestamps for determining dependency between checkpoints since vector timestamps generally result in high message overhead during failure-free operation.