Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Finding Consistent Global Checkpoints in a Distributed Computation
IEEE Transactions on Parallel and Distributed Systems
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
On the Minimal Characterization of the Rollback-Dependency Trackability Property
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Modeling Coordinated Checkpointing for Large-Scale Supercomputers
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages
IEEE Transactions on Dependable and Secure Computing
On the Fully-Informed Communication-Induced Checkpointing Protocol
PRDC '05 Proceedings of the 11th Pacific Rim International Symposium on Dependable Computing
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
Hi-index | 0.00 |
Checkpointing and rollback recovery are well-established techniques for dealing with failures in distributed systems. In this paper, we briefly summarize the existing solution approaches to these problems and also discuss the open issues, suggested approaches and some preliminary work that we have done addressing the open issues.