Quasi-atomic recovery for distributed agents
Parallel Computing
Hi-index | 0.00 |
This paper studies the use of process clones towards localizing recovery in large-scale distributed systems. A clone is a virtual recovery process with a limited life, and is useful for decoupling recovery dependencies among checkpoints. A generic Checkpoint Dependency Graph (CDG) model is used to capture the dependency relations among checkpoints. A Non-atomic Group Checkpoint (NGC) protocol is presented. It is proved that the protocol can result in localized recovery involving a single group when clones are employed. To limit recovery spread, the size of a group should be limited. This paper presents a few interesting results in this aspect: (i) there is no embedded protocol for atomic group formation with a bounded group-size (k-bounded protocol); (ii) a k-bounded atomic group checkpoint protocol requires at least m-1 explicit messages for checkpoint synchronization in a system consisting of m processes. Lastly, a simple k-bounded atomic group checkpoint protocol is presented and proved.