The design, analysis, and verification of the SIFT fault tolerant system

  • Authors:
  • John H. Wensley;Milton W. Green;Karl N. Levitt;Robert E. Shostak

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICSE '76 Proceedings of the 2nd international conference on Software engineering
  • Year:
  • 1976

Quantified Score

Hi-index 0.00

Visualization

Abstract

The SIFT (Software Implemented Fault Tolerance) computer is a fault-tolerant computer in which fault tolerance is achieved primarily by software mechanisms. Tasks are executed redundantly on multiple, independent processors that are loosely synchronized. Each processor is multiprogrammed over a set of distinct tasks. A system of independently accessible busses interconnects the processors. When Task A needs data from Task B, each version of A votes, using software, on the data computed by the different versions of B. (A processor cannot write into another processor; all communication is accomplished by reading.) Thus, errors due to a malfunctioning processor or bus can be confined to the faulty unit and can be masked, and the faulty unit can be identified. An executive routine effects the fault location and reconfigures the system by assigning the tasks, previously assigned to the faulty unit, to an operative unit. Since fault-tolerant computers are used in environments where reliability is at a premium, it is essential that the software of SIFT be correct. The software is realized as a hierarchy of modules in a way that significantly enhances proof of correctness. The behavior of each module is characterized by a formal specification, and the implementation of the module is verified with respect to its specification and those of modules at lower level of the hierarchy. An abstract, Markov-like model is used to describe the reliability behavior of SIFT. This model is formally related to the specifications of the top-most modules of the hierarchy; thus the model can be shown to describe accurately the behavior of the system. At the time of writing, the verification of the system is not complete. The paper describes the design of SIFT, the reliability model, and the approach to mapping from the system to the model.