Performance evaluation of the striped checkpointing algorithm on the distributed RAID for cluster computer

Authors:
Yun Seok Chang;Sun Young Cho;Bo Yeon Kim
Affiliations:
Department of Computer Engineering, Daejin University, Pocheon, Korea;Basic Science Research Institute, Chungbuk University, Chungju, Korea;Department of Electrical and Computer Engineering, Kangwon National University,Chuncheon, Korea
Venue:
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
Year:
2003

Citing 7
Cited 3

Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space

IEEE Concurrency
Reliable Cluster Computing with a New Checkpointing RAID-x Architecture

HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
Fault-tolerant stream processing using a distributed, replicated file system

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The distributed RAID for serverless cluster computer is used to save the checkpoint files periodically according to the checkpointing algorithm for rollback recovery. Striped checkpointing algorithm newly proposed in this paper can adopt the merits of the coordinated and the staggered checkpointing algorithms. Coordinating enables parallel I/O on distributed disks and staggering avoids network bottleneck in distributed disk I/O operations. With a fixed cluster size, we reveal the tradeoffs between these two speedup techniques. The striped checkpointing approach allows dynamical reconfiguration to minimize checkpointing overhead among concurrent software processes.We demonstrate how to reduce the overhead by striping and staggering dynamically. For communication-intensive computational programs, this new scheme can significantly reduce the checkpointing overhead. Linpack HPC Benchmark results prove the benefits of trading between stripe parallelism and distributed staggering. These results are useful to design efficient checkpointing algorithm for fast rollback recovery from any single node failure in a cluster computer.