Fault-Tolerant Computing: An Introduction and a Perspective

  • Authors:
  • C. R. Kime

  • Affiliations:
  • Department of Electrical and Computer Engineering, University of Wisconsin

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 1975

Quantified Score

Hi-index 14.98

Visualization

Abstract

FAULT-TOLERANT computing has been defined as "the ability to execute specified algorithms correctly regardless of hardware failures, total system flaws, or program fallacies" [1]. To the extent that a system falls short of meeting the requirements of this definition, it can be labeled a partially fault-tolerant system [2]. Thus the definition of fault-tolerant computing provides a standard against which to measure all systems having a degree of fault tolerance. In particular, one can classify systems according to: 1), the amount of manual intervention required in performing three basic functions, and 2) the class of faults covered by three basic functions involved in fault tolerance: system validation, fault diagnosis, and fault masking or recovery. The word "fault" here is used to inclusively describe "failures, flaws, and fallacies" in the original definition. The first function is involved in the design and production of the system hardware and software, while the last two functions are embodied in the system itself. Likewise, the first function is directed to handling faults arising from design and production errors, whereas the last two functions are aimed at faults due to random hardware failures.