Resilient distributed computing

Authors:
Liba Svobodova
Affiliations:
-
Venue:
IEEE Transactions on Software Engineering
Year:
1984

Citing 0
Cited 6

File servers for network-based distributed systems

ACM Computing Surveys (CSUR)
AVANCE: an object management system

OOPSLA '88 Conference proceedings on Object-oriented programming systems, languages and applications
An annotated bibliography of dependable distributed computing

ACM SIGOPS Operating Systems Review
Remote operations across a network of small computers

Proceedings of the 1986 ACM SIGSMALL/PC symposium on Small systems
Rajdoot: A Remote Procedure Call Mechanism Supporting Orphan Detection and Killing

IEEE Transactions on Software Engineering
On remote procedure call

CASCON '92 Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

A control abstraction called atomic action is a powerful general mechanism for ensuring consistent behavior of a system in spite of failures of individual computations running in the system, and in spite of system crashes. However, because of the ``all-or-nothing'' property of atomic actions, an important amount of work might be abandoned needlessly when an internal error is encountered. This paper discusses how implementation of resilient distributed systems can be supported using a combination of nested atomic actions and stable checkpoints. Nested atomic actions form a tree structure. When an internal atomic action terminates, its results are not made permanent until the outermost atomic action commits, but they survive local node failures. Each subtree of atomic actions is recoverable individually. A checkpoint is established in stable storage as part of a remote request so that results of such a request can be reclaimed if the requesting node fails in the meantime, The paper shows how remote procedure call primitives with ``at-most-once'' semantics and recovery blocks can be built with these mechanisms.