A Faster Checkpointing and Recovery Algorithm with a Hierarchical Storage Approach

Authors:
Wen GAO;Mingyu CHEN;Takashi NANYA
Affiliations:
Chinese Academy of Sciences;Chinese Academy of Sciences;University of Tokyo
Venue:
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Year:
2005

Citing 4
Cited 0

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform

CLADE '03 Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault tolerant is an inevitable part of cluster operating system. In SCore cluster system, it provides coordinated checkpointing, rollback recovery mechanism and watch-dog timer detector for fault tolerance. In the checkpointing algorithm in Score, disk write is the bottleneck. To eliminate disk write overhead, this paper proposes a new diskless checkpointing and rollback recovery algorithm. Since the proposed algorithm does not need to calculate parity and write the checkpointing data into disk, it is analyzed to be a faster checkpointing algorithm than the original one. Based on comparison, the recovery time of the proposed algorithm is also less. However, the cluster can not tolerant multiple transient failure using this diskless checkpointing algorithm. To compensate this drawback, a hierarchical storage strategy is adopted. An experimental result will be shown that this diskless algorithm with a hierarchical storage approach is fast and effective.