A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Authors:
D. Manivannan;Q. Jiang;Jianchang Yang;M. Singhal
Affiliations:
Department of Computer Science, University of Kentucky, Lexington, KY 40506, United States;Department of Computer Science, University of Kentucky, Lexington, KY 40506, United States;Department of Computer and Information Sciences, SUNY, Fredonia, Fredonia, NY 14063, United States;Department of Computer Science, University of Kentucky, Lexington, KY 40506, United States
Venue:
Information Sciences: an International Journal
Year:
2008

Citing 22
Cited 3

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
On selecting rollback points for error recovery

Information Sciences: an International Journal
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Observing Global States of Asynchronous Distributed Applications

Proceedings of the 3rd International Workshop on Distributed Algorithms
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Quantifying rollback propagation in distributed checkpointing

Journal of Parallel and Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Performance and effectiveness trade-off for checkpointing in fault-tolerant distributed systems: Research Articles

Concurrency and Computation: Practice & Experience
Self-stabilizing algorithm for checkpointing in a distributed system

Journal of Parallel and Distributed Computing
Performance evaluation of the striped checkpointing algorithm on the distributed RAID for cluster computer

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm

IWDC'05 Proceedings of the 7th international conference on Distributed Computing

Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions

ICS'08 Proceedings of the 12th WSEAS international conference on Systems
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.07

Visualization

Abstract

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead.