A Fault Tolerance Infrastructure for Dependable Computing with High-Performance COTS Components

  • Authors:
  • Algirdas Avizienis

  • Affiliations:
  • -

  • Venue:
  • DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

The failure rates of current COTS processors have dropped to 100 FITs (failures per 109 hours), indicating a potential MTTF of over 1100 years. However, our recent study of Intel P6 family processors has shown that they have very limited error detection and recovery capabilities and contain numerous design faults (驴errata驴). Other limitations are susceptibility to transient faults and uncertainty about 驴wearout驴 that could increase the failure rate in time. Because of these limitations, an external fault tolerance infrastructure is needed to assure the dependability of a system with such COTS components.The paper describes a fault-tolerant 驴infrastructure驴 system of fault tolerance functions that makes possible the use of low-coverage COTS processors in a fault-tolerant, self-repairing system. The custom hardware supports transient recovery, design fault tolerance, and self-repair by sparing and replacement. Four types of hardware processors of low complexity that are fault-tolerant implement fault tolerance functions. High error detection coverage, including design faults, is attained by diversity and replication.