Stable Checkpointing in Distributed Systems without Shared Disks

  • Authors:
  • Peter Sobe

  • Affiliations:
  • -

  • Venue:
  • IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Interacting processes in distributed systems save their checkpoints on local disks for efficiency reasons. But, because local checkpoints get unavailable with failing hosts, redundancy schemes similar to RAID-like storage schemes have to be used. In such systems,checkpoints are stable under a particular fault model because they can get reconstructed in the distributed system. In this paper, two variants of stable checkpoint storage will be compared, (i) parity grouping over local checkpoints and (ii) RAID-like distribution of each checkpoint using a software based distributed storage system. An analysis is given to compare costs for collective checkpoint creation, recovery of a single process and rollbackof all processes. The results show that despite of differences in detail, checkpointing using a distributed storage system is a reasonable solution.