Toward Systematic Design of Fault-Tolerant Systems

  • Authors:
  • Algirdas Avizienis

  • Affiliations:
  • -

  • Venue:
  • Computer
  • Year:
  • 1997

Quantified Score

Hi-index 4.10

Visualization

Abstract

The mid-century "space race" was a major impetus for the development of fault-tolerant computing. Over the succeeding 25 years researchers expanded the concept of fault tolerance and refined the techniques for achieving it. Nevertheless, the bottom-up approach, entailing an infrastructure of autonomously fault-tolerant subsystems integrated with global fault tolerance functions, is less common today than the top-down approach, which relies on off-the-shelf (OTS) subsystems and a global monitoring function. A design paradigm for the systematic treatment of fault tolerance involves four steps: specification, implementation, evaluation, and modification. The paradigm offers a way to minimize the probability of oversights, mistakes, and inconsistencies that may occur during the implementation of fault tolerance. In spite of the long-range merits of this bottom-up approach, time and cost constraints often lead developers to use OTS subsystems when designing systems that are expected to be highly dependable. Even the Pentium Pro, which appears to have the most complete set of fault tolerance functions among contemporary microprocessors, has major drawbacks. Moreover, systems built from OTS subsystems are difficult to retrofit for fault tolerance. Without hardware support for fault tolerance, the only solution is to build a software monitor subsystem that tries to check all subsystems for indications of failure. But the monitor itself is unprotected because it resides and executes on an OTS processor. Researchers would do well to consider the human immune system as a model for systems in which fault tolerance is an integral attribute of every hardware element.