Availability of a distributed computer system with failures
Acta Informatica
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Checkpointing in distributed computing systems
Journal of Parallel and Distributed Computing
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
A Case for Two-Level Recovery Schemes
IEEE Transactions on Computers
On the Optimum Checkpoint Interval
Journal of the ACM (JACM)
Performance of rollback recovery systems under intermittent failures
Communications of the ACM
Processor allocation and checkpoint interval selection in cluster computing systems
Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Skewed Checkpointing for Tolerating Multi-Node Failures
SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Optimal recovery schemes in fault tolerant distributed computing
Acta Informatica
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Numerical computation algorithms for sequential checkpoint placement
Performance Evaluation
A reliability-aware approach for an optimal checkpoint/restart model in HPC environments
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Evaluation of fault-tolerant policies using simulation
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A flexible checkpoint/restart model in distributed systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Hi-index | 0.00 |
Cluster systems play a central role to realize high performance computing with relatively low cost, and at the same time are necessary the fault-tolerance features for the practical use. In this paper we develop stochastic models to evaluate the expected total recovery overhead for a cluster computing system with three well-known checkpoint and rollback recovery schemes; checkpoint mirroring, central file server checkpointing and skewed checkpointing, where the fault latency time after a system failure is given by a random variable. In general, since the multi-node failure as well as single-node failure may occur in the cluster system, it is not so easy to obtain the closed form of expected total recovery overhead. Based on a simple failure model, we do this by listing up all the possible combinations of probabilistic events caused by the multi-node failure. Further we compare the respective expected total recovery overhead with different checkpoint and rollback recovery schemes, and evaluate quantitatively the effectiveness of these schemes.