ACM SIGOPS Operating Systems Review
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
A Case for Two-Level Recovery Schemes
IEEE Transactions on Computers
Evaluating Distributed Checkpointing Protocol
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
A Checkpointing Method with Small Checkpoint Latency
IEICE - Transactions on Information and Systems
Design and modeling of a non-blocking checkpointing system
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.01 |
This report deals with the design and evaulation of a "two-level" failure recovery scheme for distributed systems. In our previous work [30, 32], we motivated a "two-level" recovery approach that tolerates the more probable failures with a low overhead, and less probable failures with possibly higher overhead. The two-level approach can achieve a smaller overhead as compared to traditional recovery schemes. In this report, we present and evaluate a "two-level" recovery scheme that is suitable for a network of workstations, each workstation having a local disk. The recovery scheme presented in the report can tolerate trasient processor failures with a low overhead, while other failures require a larger overhead. The report presents analysis of the average (expected) task completion time using the proposed scheme. This scheme has been implemented on a workstation cluster. Our analysis indicates that the proposed two-level recovery scheme can achieve better performance as compared to existing "one-level" recovery schemes. The report also evaluates the impact of checkpoint latency on the performance of the recovery scheme. To our knowledge, no analysis of the performance impact of checkpoint latency has been carried out previously. Experimental measurements of checkpoint latency for four applications are presented.