Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
The Cost of Recovery in Message Logging Protocols
IEEE Transactions on Knowledge and Data Engineering
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
System structure for software fault tolerance
Proceedings of the international conference on Reliable software
Causality tracking in causal message-logging protocols
Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Communication-based prevention of useless checkpoints in distributed computations
Distributed Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
A Coarse-Grained Pessimistic Message Logging Scheme for Improving Rollback Recovery Efficiency
DASC '07 Proceedings of the Third IEEE International Symposium on Dependable, Autonomic and Secure Computing
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
Application and middleware transparent checkpointing with TCKPT on ClusterGrids
Future Generation Computer Systems
Team-Based Message Logging: Preliminary Results
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Coordinated checkpoint from message payload in pessimistic sender-based message logging
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Correlated set coordination in fault tolerant message logging protocols
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Super-Scalable algorithms for computing on 100,000 processors
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Independent checkpointing in a heterogeneous grid environment
Future Generation Computer Systems
Hi-index | 0.00 |
Future generation supercomputers will be message-passing distributed systems consisting of hundreds of thousands of processors. As the size of the system grows, failure rate increases. Hence for the success and deployability of such large scale systems, scalable checkpointing and recovery protocols need to be implemented. Existing checkpointing and rollback recovery protocols used for providing fault tolerance in distributed systems are not scalable to such large systems. In this paper, we address this important and timely issue and propose a scalable group-based Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging (HOPE) protocol. Performance evaluation indicates, our protocol takes a balanced approach to lower checkpointing and message logging overhead and enhances scalability.