Comparing checkpoint and rollback recovery schemes in a cluster system

Authors:
Noriaki Bessho;Tadashi Dohi
Affiliations:
Department of Information Engineering, Graduate School of Engineering, Hiroshima University, Higashi-Hiroshima, Japan;Department of Information Engineering, Graduate School of Engineering, Hiroshima University, Higashi-Hiroshima, Japan
Venue:
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Year:
2012

Citing 18
Cited 0

Availability of a distributed computer system with failures

Acta Informatica
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Checkpointing in distributed computing systems

Journal of Parallel and Distributed Computing
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
A Case for Two-Level Recovery Schemes

IEEE Transactions on Computers
On the Optimum Checkpoint Interval

Journal of the ACM (JACM)
Performance of rollback recovery systems under intermittent failures

Communications of the ACM
Processor allocation and checkpoint interval selection in cluster computing systems

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Skewed Checkpointing for Tolerating Multi-Node Failures

SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
Optimal recovery schemes in fault tolerant distributed computing

Acta Informatica
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Numerical computation algorithms for sequential checkpoint placement

Performance Evaluation
A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Evaluation of fault-tolerant policies using simulation

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cluster systems play a central role to realize high performance computing with relatively low cost, and at the same time are necessary the fault-tolerance features for the practical use. In this paper we develop stochastic models to evaluate the expected total recovery overhead for a cluster computing system with three well-known checkpoint and rollback recovery schemes; checkpoint mirroring, central file server checkpointing and skewed checkpointing, where the fault latency time after a system failure is given by a random variable. In general, since the multi-node failure as well as single-node failure may occur in the cluster system, it is not so easy to obtain the closed form of expected total recovery overhead. Based on a simple failure model, we do this by listing up all the possible combinations of probabilistic events caused by the multi-node failure. Further we compare the respective expected total recovery overhead with different checkpoint and rollback recovery schemes, and evaluate quantitatively the effectiveness of these schemes.