Comparative Analysis of Different Models of Checkpointing and Recovery
IEEE Transactions on Software Engineering
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Computer organization & design: the hardware/software interface
Computer organization & design: the hardware/software interface
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Performance analysis of checkpointing strategies
ACM Transactions on Computer Systems (TOCS)
Performance of rollback recovery systems under intermittent failures
Communications of the ACM
A first order approximation to the optimum checkpoint interval
Communications of the ACM
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture
IEEE Transactions on Computers
A model of roll-back recovery with multiple checkpoints
ICSE '76 Proceedings of the 2nd international conference on Software engineering
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Optimal Message Logging Protocols \\ (Preliminary Version)
Optimal Message Logging Protocols \'\' (Preliminary Version)
Another Two-Level Failure Recovery Scheme
Another Two-Level Failure Recovery Scheme
A Case of Multi-Level Distributed Recovery Schemes
A Case of Multi-Level Distributed Recovery Schemes
Performance Optimization of Checkpointing Schemes with Task Duplication
IEEE Transactions on Computers
A Case for Two-Level Recovery Schemes
IEEE Transactions on Computers
Support for Software Interrupts in Log-Based Rollback-Recovery
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering
ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Supporting nondeterministic execution in fault-tolerant systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Evaluating Distributed Checkpointing Protocol
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Power Save Mechanisms for Multi-Hop Wireless Networks
BROADNETS '04 Proceedings of the First International Conference on Broadband Networks
Ad hoc routing for multilevel power save protocols
Ad Hoc Networks
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
Optimization of checkpointing-related I/O for high-performance parallel and distributed computing
The Journal of Supercomputing
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Interconnect agnostic checkpoint/restart in open MPI
Proceedings of the 18th ACM international symposium on High performance distributed computing
Proceedings of the 2009 workshop on Resiliency in high performance
Optimal checkpointing interval for two-level recovery schemes
Computers & Mathematics with Applications
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
Design and modeling of a non-blocking checkpointing system
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Comparing checkpoint and rollback recovery schemes in a cluster system
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.01 |
Most distributed and multiprocessor recovery schemes proposed in the literature are designed to tolerate arbitrary number of failures. In this paper, we demonstrate that, it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may be tolerated with a higher overhead. By minimizing the overhead for the more frequently occurring failure scenarios, our approach is expected to achieve lower performance overhead (on average) as compared to existing recovery schemes.To demonstrate the advantages of two-level recovery, we evaluate the performance of a recovery scheme that takes two different types of checkpoints, namely, 1-checkpoints and N-checkpoints. A single failure can be tolerated by rolling the system back to a 1-checkpoint, while multiple failure recovery is possible by rolling back to an N-checkpoint. For such a system, we demonstrate that to minimize the average overhead, it is often necessary to take both 1-checkpoints and N-checkpoints.While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance.