Algorithm-Based Fault Tolerance for Matrix Operations

  • Authors:
  • Kuang-Hua Huang;J. A. Abraham

  • Affiliations:
  • Engineering Research Center, AT&TTechnologies, Inc.;-

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 1984

Quantified Score

Hi-index 14.99

Visualization

Abstract

The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple copies of low-cost processors to provide a large amount of computational capability for a small cost. In addition to achieving high performance, high reliability is also important to ensure that the results of long computations are valid. This paper proposes a novel system-level method of achieving high reliability, called algorithm-based fault tolerance. The technique encodes data at a high level, and algorithms are designed to operate on encoded data and produce encoded output data. The computation tasks within an algorithm are appropriately distributed among multiple computation units for fault tolerance. The technique is applied to matrix compomations which form the heart of many computation-intensive tasks. Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems. The method proposed can detect and correct any failure within a single processor in a multiple processor system. The number of processors needed to just detect errors in matrix multiplication is also studied.