Quantifying rollback propagation in distributed checkpointing

Authors:
Adnan Agbaria;Hagit Attiya;Roy Friedman;Roman Vitenberg
Affiliations:
Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel
Venue:
Journal of Parallel and Distributed Computing
Year:
2004

Citing 20
Cited 5

Recoverable Distributed Shared Virtual Memory

IEEE Transactions on Computers
Introduction to algorithms

Introduction to algorithms
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Space reclamation for uncoordinated checkpointing in message-passing systems

Space reclamation for uncoordinated checkpointing in message-passing systems
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Evaluations of domino-free communication-induced checkpointing protocols

Information Processing Letters
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Rollback-dependency trackability: a minimal characterization and its protocol

Information and Computation
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Preventing Useless Checkpoints in Distributed Computations

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
The Cost of Recovery in Message Logging Protocols

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Evaluating Distributed Checkpointing Protocol

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance

An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Communication-based prevention of useless checkpoints in distributed computations

Distributed Computing

Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions

ICS'08 Proceedings of the 12th WSEAS international conference on Systems
A multi-cycle checkpointing protocol that ensures strict 1-rollback

Information Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a new classification of executions with checkpoints based on the amount of rollback during recovery. Specifically, an execution is k-rollback, if k indicates the maximal number of checkpoints that have to be rolled back. It is shown that coordinated checkpointing, SZPF, and ZPF are 1-rollback, while ZCF is (n - 1)-rollback, where n is the number of participants in an execution.A new class of executions, called d-bounded cycles (in short, d-BC), is introduced, and is shown to be ((n - 1)ċ d)-rollback (ZCF is a special case of d-BC for d = 1).Finally, a protocol is presented whose executions are d-bounded cycles. A nice property of this protocol is that it does not impose any control information overhead on application messages, yet sends only a few control messages of its own. Moreover, the protocol maintains information that enables very efficient discovery of a recent recovery line that existed shortly before the failure.