Nested transactions: an approach to reliable distributed computing
Nested transactions: an approach to reliable distributed computing
Optimal checkpointing of real-time tasks
IEEE Transactions on Computers
Comparative Analysis of Different Models of Checkpointing and Recovery
IEEE Transactions on Software Engineering
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Computer organization & design: the hardware/software interface
Computer organization & design: the hardware/software interface
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Performance and recoverability of distributed shared memory systems using competitive update
Performance and recoverability of distributed shared memory systems using competitive update
Performance analysis of checkpointing strategies
ACM Transactions on Computer Systems (TOCS)
Performance of rollback recovery systems under intermittent failures
Communications of the ACM
A first order approximation to the optimum checkpoint interval
Communications of the ACM
Probability and statistics with reliability, queuing and computer science applications
Probability and statistics with reliability, queuing and computer science applications
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
A model of roll-back recovery with multiple checkpoints
ICSE '76 Proceedings of the 2nd international conference on Software engineering
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Optimal Message Logging Protocols \\ (Preliminary Version)
Optimal Message Logging Protocols \'\' (Preliminary Version)
Another Two-Level Failure Recovery Scheme
Another Two-Level Failure Recovery Scheme
A Case of Multi-Level Distributed Recovery Schemes
A Case of Multi-Level Distributed Recovery Schemes
Performance Evaluation of a Two Level Error Recovery Scheme for Distributed Systems
IWDC '02 Proceedings of the 4th International Workshop on Distributed Computing, Mobile and Wireless Computing
Fault tolerant high performance computing by a coding approach
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Optimal checkpointing interval for two-level recovery schemes
Computers & Mathematics with Applications
Design and performance evaluation of enhanced two-level recovery scheme
PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Reliability of a job execution process using signatures
Mathematical and Computer Modelling: An International Journal
Comparing checkpoint and rollback recovery schemes in a cluster system
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Hi-index | 14.98 |
Long-running applications are often subject to failures. Failures can result in significant loss of computation. Therefore, it is necessary to use a failure recovery scheme to minimize performance overhead in the presence of failures. In this paper, we argue that it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may possibly incur a higher overhead. By minimizing overhead for the more frequently occurring failure scenarios, the two-level approach can achieve lower performance overhead (on average) as compared to existing recovery schemes.The paper describes two two-level recovery schemes. Performance analysis using a Markov chain shows that, in practice, a two-level scheme can perform better than its "one-level" counterpart. While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance and achieve better performance than existing recovery schemes. The paper presents an analytical approach for evaluating performance of two-level schemes and shows that such schemes are hard to optimize analytically.