A Case for Two-Level Recovery Schemes

Authors:
Nitin H. Vaidya
Affiliations:
Texas A&M Univ., College Station
Venue:
IEEE Transactions on Computers
Year:
1998

Citing 19
Cited 7

Nested transactions: an approach to reliable distributed computing

Nested transactions: an approach to reliable distributed computing
Optimal checkpointing of real-time tasks

IEEE Transactions on Computers
Comparative Analysis of Different Models of Checkpointing and Recovery

IEEE Transactions on Software Engineering
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Computer organization & design: the hardware/software interface

Computer organization & design: the hardware/software interface
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Performance and recoverability of distributed shared memory systems using competitive update

Performance and recoverability of distributed shared memory systems using competitive update
Performance analysis of checkpointing strategies

ACM Transactions on Computer Systems (TOCS)
Performance of rollback recovery systems under intermittent failures

Communications of the ACM
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Probability and statistics with reliability, queuing and computer science applications

Probability and statistics with reliability, queuing and computer science applications
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
A model of roll-back recovery with multiple checkpoints

ICSE '76 Proceedings of the 2nd international conference on Software engineering
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Optimal Message Logging Protocols \\ (Preliminary Version)

Optimal Message Logging Protocols \'\' (Preliminary Version)
Another Two-Level Failure Recovery Scheme

Another Two-Level Failure Recovery Scheme
A Case of Multi-Level Distributed Recovery Schemes

A Case of Multi-Level Distributed Recovery Schemes

Performance Evaluation of a Two Level Error Recovery Scheme for Distributed Systems

IWDC '02 Proceedings of the 4th International Workshop on Distributed Computing, Mobile and Wireless Computing
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Optimal checkpointing interval for two-level recovery schemes

Computers & Mathematics with Applications
Design and performance evaluation of enhanced two-level recovery scheme

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Reliability of a job execution process using signatures

Mathematical and Computer Modelling: An International Journal
Comparing checkpoint and rollback recovery schemes in a cluster system

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I

Quantified Score

Hi-index	14.98

Visualization

Abstract

Long-running applications are often subject to failures. Failures can result in significant loss of computation. Therefore, it is necessary to use a failure recovery scheme to minimize performance overhead in the presence of failures. In this paper, we argue that it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may possibly incur a higher overhead. By minimizing overhead for the more frequently occurring failure scenarios, the two-level approach can achieve lower performance overhead (on average) as compared to existing recovery schemes.The paper describes two two-level recovery schemes. Performance analysis using a Markov chain shows that, in practice, a two-level scheme can perform better than its "one-level" counterpart. While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance and achieve better performance than existing recovery schemes. The paper presents an analytical approach for evaluating performance of two-level schemes and shows that such schemes are hard to optimize analytically.