Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
On selecting rollback points for error recovery
Information Sciences: an International Journal
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Observing Global States of Asynchronous Distributed Applications
Proceedings of the 3rd International Workshop on Distributed Algorithms
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering
ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Concurrency and Computation: Practice & Experience
Self-stabilizing algorithm for checkpointing in a distributed system
Journal of Parallel and Distributed Computing
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
ICS'08 Proceedings of the 12th WSEAS international conference on Systems
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
Journal of Parallel and Distributed Computing
Hi-index | 0.07 |
Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead.