Stable Checkpointing in Distributed Systems without Shared Disks

Authors:
Peter Sobe
Affiliations:
-
Venue:
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Year:
2003

Citing 0
Cited 6

Distributed Storage Layout Schemes

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Strategies for Checkpoint Storage on Opportunistic Grids

IEEE Distributed Systems Online
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Construction of efficient or-based deletion-tolerant coding schemes

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Interacting processes in distributed systems save their checkpoints on local disks for efficiency reasons. But, because local checkpoints get unavailable with failing hosts, redundancy schemes similar to RAID-like storage schemes have to be used. In such systems,checkpoints are stable under a particular fault model because they can get reconstructed in the distributed system. In this paper, two variants of stable checkpoint storage will be compared, (i) parity grouping over local checkpoints and (ii) RAID-like distribution of each checkpoint using a software based distributed storage system. An analysis is given to compare costs for collective checkpoint creation, recovery of a single process and rollbackof all processes. The results show that despite of differences in detail, checkpointing using a distributed storage system is a reasonable solution.