HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Authors:
Yi Luo;D. Manivannan
Affiliations:
-;-
Venue:
Future Generation Computer Systems
Year:
2012

Citing 28
Cited 0

Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
The Cost of Recovery in Message Logging Protocols

IEEE Transactions on Knowledge and Data Engineering
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
System structure for software fault tolerance

Proceedings of the international conference on Reliable software
Causality tracking in causal message-logging protocols

Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
Communication-based prevention of useless checkpoints in distributed computations

Distributed Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
A Coarse-Grained Pessimistic Message Logging Scheme for Improving Rollback Recovery Efficiency

DASC '07 Proceedings of the Third IEEE International Symposium on Dependable, Autonomic and Secure Computing
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems

Journal of Parallel and Distributed Computing
Application and middleware transparent checkpointing with TCKPT on ClusterGrids

Future Generation Computer Systems
Team-Based Message Logging: Preliminary Results

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Coordinated checkpoint from message payload in pessimistic sender-based message logging

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families

Performance Evaluation
Correlated set coordination in fault tolerant message logging protocols

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Super-Scalable algorithms for computing on 100,000 processors

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Independent checkpointing in a heterogeneous grid environment

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Future generation supercomputers will be message-passing distributed systems consisting of hundreds of thousands of processors. As the size of the system grows, failure rate increases. Hence for the success and deployability of such large scale systems, scalable checkpointing and recovery protocols need to be implemented. Existing checkpointing and rollback recovery protocols used for providing fault tolerance in distributed systems are not scalable to such large systems. In this paper, we address this important and timely issue and propose a scalable group-based Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging (HOPE) protocol. Performance evaluation indicates, our protocol takes a balanced approach to lower checkpointing and message logging overhead and enhances scalability.