Cyclic Storage for Fault-Tolerant Distributed Executions

Authors:
Ricardo Marcelin-Jimenez;Sergio Rajsbaum;Brett Stevens
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2006

Citing 7
Cited 0

Efficient dispersal of information for security, load balancing, and fault tolerance

Journal of the ACM (JACM)
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Ordering disks for double erasure codes

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Distributed Algorithms

Distributed Algorithms
Maintenance-Free Global Data Storage

IEEE Internet Computing
Repeated Computation of Global Functions in a Distributed Environment

IEEE Transactions on Parallel and Distributed Systems
Evaluation of checkpoint mechanisms for massively parallel machines

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a set V of active components in charge of a distributed execution, a storage scheme is a sequence B_{0}, B_{1}, \ldots, B_{b-1} of subsets of V, where successive global states are recorded. The subsets, also called blocks, have the same size and are scheduled according to some fixed and cyclic calendar of b steps. During the i\rm th step, block B_{i} is selected. Each component takes a copy of its local state and sends it to one of the components in B_i, in such a way that each component stores (approximately) the same number of local states. Afterward, if a component of B_{i} crashes, all of its stored data is lost and the computation cannot continue. If there exists a block with no failed components in it, then a recent global state can be retrieved and the computation does not need to start over from the very beginning. The goal is to design storage schemes that tolerate as many crashes as possible, while trying to have each component participating in as few blocks as possible and, at the same time, working with large blocks (so that a component in a block stores a small number of local states). In this paper, several such schemes are described and compared in terms of these measures.