Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

  • Authors:
  • Prithviraj Banerjee;Joe T. Rahmeh;Craig Stunkel;V. S. Nair;Kaushik Roy;Vijay Balasubramanian;Jacob A. Abraham

  • Affiliations:
  • -;-;-;-;-;-;-

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 1990

Quantified Score

Hi-index 15.02

Visualization

Abstract

The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors.