Reconfiguration and transient recovery in state machine architectures

Authors:
J. Rushby
Affiliations:
-
Venue:
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Year:
1996

Citing 9
Cited 2

Synchronizing clocks in the presence of faults

Journal of the ACM (JACM)
The MAFT Architecture for Distributed Fault Tolerance

IEEE Transactions on Computers - Fault-Tolerant Computing
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Replica determinism in distributed real-time systems: a brief survey

Real-Time Systems
A formally verified algorithm for clock synchronization under a hybrid fault model

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Reaching Agreement in the Presence of Faults

Journal of the ACM (JACM)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed Fault-Tolerant Real-Time Systems: The Mars Approach

IEEE Micro
Fault-tolerant clock synchronization

PODC '84 Proceedings of the third annual ACM symposium on Principles of distributed computing

Formally Verified On-Line Diagnosis

IEEE Transactions on Software Engineering
Replication Management in Reliable Real-Time Systems

Real-Time Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider an architecture for ultra-dependable operation based on synchronized state machine replication, extended to provide transient recovery and reconfiguration in the presence of arbitrary faults. The architecture allows processors suspected of being faulty to be placed on "probation." Processors in this status cannot disrupt other processors, but those that are nonfaulty or recovering from transient faults are able to remain synchronized with the other processors and with each other, can participate in interactively consistent exchange of data (i.e., Byzantine agreement), and can restore damaged state data by loading majority-voted copies from other processors. The processors that are not on probation are able to coordinate membership of their group and to take processors on and off probation. These properties are achieved even if all the processors on probation and some of the others exhibit Byzantine faults, provided a majority of all processors are nonfaulty. Key elements of the architecture are modified treatments for the problems of interactive consistency, clock synchronization, and group membership. Classical algorithms for these problems that tolerate t Byzantine faults among n processors are extended to tolerate t+p faults among n+p processors, partitioned into n "core members" and p "probationers," provided no more than t faults occur among the core members.