Fair distribution of concerns in design and evaluation of fault-tolerant distributed computer systems

Authors:
Kh (Kane) Kim
Affiliations:
Department of Electrical and Computer Enginering, University of California, Irvine, CA 92717, USA
Venue:
Computer Communications
Year:
1994

Citing 4
Cited 0

Clock synchronization in distributed real-time systems

IEEE Transactions on Computers - Special Issue on Real-Time Systems
Reaching Agreement in the Presence of Faults

Journal of the ACM (JACM)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
System structure for software fault tolerance

IEEE Transactions on Software Engineering

Quantified Score

Hi-index	0.24

Visualization

Abstract

In analysing the fault tolerance capabilities of distributed computer system designs, the functionally or physically replaceable components have usually been modelled as one of the two extreme types with respect to their failure symptoms (faulty output behaviour): the fail-silent unit (FSU) model at the simplest end, and the malicious unit (MaU) model, also called the Byzantine unit model, at the other end. The basic weaknesses of these models for use in practical system design and evaluation are pointed out. The FSU model is justifiable for handling most simple parts of a system, but not so for other complex parts. The MaU model is misleading in that it tends to draw attention to events of negligible occurrence probabilities while taking attention away from events of higher occurrence probabilities. It is also pointed out that the state-of-the-art in analytic modelling and evaluation of fault-tolerant distributed computer systems has a vast weakly characterized region in the domain of conceivable component models enclosed by the two extreme models. The main constructive proposition made with respect to advancing the state-of-the-art is to establish scientific procedures for fair distribution of concerns over possible occurrences of anomalous events during system design and validation. A direction for obtaining such a fair modelling procedure relying on extensive probabilistic reasoning is proposed. The principle may have broad applicability, extending much beyond the area of fault-tolerant computing.