The design, analysis, and verification of the SIFT fault tolerant system

Authors:
John H. Wensley;Milton W. Green;Karl N. Levitt;Robert E. Shostak
Affiliations:
-;-;-;-
Venue:
ICSE '76 Proceedings of the 2nd international conference on Software engineering
Year:
1976

Citing 2
Cited 4

A technique for software module specification with examples

Communications of the ACM
Specification techniques

DAC '76 Proceedings of the 13th Design Automation Conference

Software development and proofs of multi-level security

ICSE '76 Proceedings of the 2nd international conference on Software engineering
Synchronous Consensus for Dependent Process Failures

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Reliability and Availability Models for Maintained Systems Featuring Hardware Failures and Design Faults

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

The SIFT (Software Implemented Fault Tolerance) computer is a fault-tolerant computer in which fault tolerance is achieved primarily by software mechanisms. Tasks are executed redundantly on multiple, independent processors that are loosely synchronized. Each processor is multiprogrammed over a set of distinct tasks. A system of independently accessible busses interconnects the processors. When Task A needs data from Task B, each version of A votes, using software, on the data computed by the different versions of B. (A processor cannot write into another processor; all communication is accomplished by reading.) Thus, errors due to a malfunctioning processor or bus can be confined to the faulty unit and can be masked, and the faulty unit can be identified. An executive routine effects the fault location and reconfigures the system by assigning the tasks, previously assigned to the faulty unit, to an operative unit. Since fault-tolerant computers are used in environments where reliability is at a premium, it is essential that the software of SIFT be correct. The software is realized as a hierarchy of modules in a way that significantly enhances proof of correctness. The behavior of each module is characterized by a formal specification, and the implementation of the module is verified with respect to its specification and those of modules at lower level of the hierarchy. An abstract, Markov-like model is used to describe the reliability behavior of SIFT. This model is formally related to the specifications of the top-most modules of the hierarchy; thus the model can be shown to describe accurately the behavior of the system. At the time of writing, the verification of the system is not complete. The paper describes the design of SIFT, the reliability model, and the approach to mapping from the system to the model.