Performance evaluation of the striped checkpointing algorithm on the distributed RAID for cluster computer

  • Authors:
  • Yun Seok Chang;Sun Young Cho;Bo Yeon Kim

  • Affiliations:
  • Department of Computer Engineering, Daejin University, Pocheon, Korea;Basic Science Research Institute, Chungbuk University, Chungju, Korea;Department of Electrical and Computer Engineering, Kangwon National University,Chuncheon, Korea

  • Venue:
  • ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The distributed RAID for serverless cluster computer is used to save the checkpoint files periodically according to the checkpointing algorithm for rollback recovery. Striped checkpointing algorithm newly proposed in this paper can adopt the merits of the coordinated and the staggered checkpointing algorithms. Coordinating enables parallel I/O on distributed disks and staggering avoids network bottleneck in distributed disk I/O operations. With a fixed cluster size, we reveal the tradeoffs between these two speedup techniques. The striped checkpointing approach allows dynamical reconfiguration to minimize checkpointing overhead among concurrent software processes.We demonstrate how to reduce the overhead by striping and staggering dynamically. For communication-intensive computational programs, this new scheme can significantly reduce the checkpointing overhead. Linpack HPC Benchmark results prove the benefits of trading between stripe parallelism and distributed staggering. These results are useful to design efficient checkpointing algorithm for fast rollback recovery from any single node failure in a cluster computer.