Cloning-Based Checkpoint for Localized Recovery

Authors:
Zunce Wei;Hon F. Li;Dhrubajyoti Goswami
Affiliations:
Concordia University;Concordia University;Concordia University
Venue:
ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks
Year:
2005

Citing 0
Cited 1

Quasi-atomic recovery for distributed agents

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies the use of process clones towards localizing recovery in large-scale distributed systems. A clone is a virtual recovery process with a limited life, and is useful for decoupling recovery dependencies among checkpoints. A generic Checkpoint Dependency Graph (CDG) model is used to capture the dependency relations among checkpoints. A Non-atomic Group Checkpoint (NGC) protocol is presented. It is proved that the protocol can result in localized recovery involving a single group when clones are employed. To limit recovery spread, the size of a group should be limited. This paper presents a few interesting results in this aspect: (i) there is no embedded protocol for atomic group formation with a bounded group-size (k-bounded protocol); (ii) a k-bounded atomic group checkpoint protocol requires at least m-1 explicit messages for checkpoint synchronization in a system consisting of m processes. Lastly, a simple k-bounded atomic group checkpoint protocol is presented and proved.