A case for two-level distributed recovery schemes

Authors:
Nitin H. Vaidya
Affiliations:
Department of Computer Science, Texas A&M University, College Station, TX
Venue:
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Year:
1995

Citing 14
Cited 26

Comparative Analysis of Different Models of Checkpointing and Recovery

IEEE Transactions on Software Engineering
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Computer organization & design: the hardware/software interface

Computer organization & design: the hardware/software interface
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Performance analysis of checkpointing strategies

ACM Transactions on Computer Systems (TOCS)
Performance of rollback recovery systems under intermittent failures

Communications of the ACM
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture

IEEE Transactions on Computers
A model of roll-back recovery with multiple checkpoints

ICSE '76 Proceedings of the 2nd international conference on Software engineering
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Optimal Message Logging Protocols \\ (Preliminary Version)

Optimal Message Logging Protocols \'\' (Preliminary Version)
Another Two-Level Failure Recovery Scheme

Another Two-Level Failure Recovery Scheme
A Case of Multi-Level Distributed Recovery Schemes

A Case of Multi-Level Distributed Recovery Schemes

Performance Optimization of Checkpointing Schemes with Task Duplication

IEEE Transactions on Computers
A Case for Two-Level Recovery Schemes

IEEE Transactions on Computers
Support for Software Interrupts in Log-Based Rollback-Recovery

IEEE Transactions on Computers
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space

IEEE Concurrency
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Evaluating Distributed Checkpointing Protocol

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Power Save Mechanisms for Multi-Hop Wireless Networks

BROADNETS '04 Proceedings of the First International Conference on Broadband Networks
Ad hoc routing for multilevel power save protocols

Ad Hoc Networks
Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

The Journal of Supercomputing
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing
Towards resilient high performance applications through real time reliability metric generation and autonomous failure correction

Proceedings of the 2009 workshop on Resiliency in high performance
Optimal checkpointing interval for two-level recovery schemes

Computers & Mathematics with Applications
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Design and modeling of a non-blocking checkpointing system

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Comparing checkpoint and rollback recovery schemes in a cluster system

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.01

Visualization

Abstract

Most distributed and multiprocessor recovery schemes proposed in the literature are designed to tolerate arbitrary number of failures. In this paper, we demonstrate that, it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may be tolerated with a higher overhead. By minimizing the overhead for the more frequently occurring failure scenarios, our approach is expected to achieve lower performance overhead (on average) as compared to existing recovery schemes.To demonstrate the advantages of two-level recovery, we evaluate the performance of a recovery scheme that takes two different types of checkpoints, namely, 1-checkpoints and N-checkpoints. A single failure can be tolerated by rolling the system back to a 1-checkpoint, while multiple failure recovery is possible by rolling back to an N-checkpoint. For such a system, we demonstrate that to minimize the average overhead, it is often necessary to take both 1-checkpoints and N-checkpoints.While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance.