A novel approach to system-level fault tolerance in hypercube multiprocessors

  • Authors:
  • P. Banerjee;C. B. Stunkel

  • Affiliations:
  • Computer Systems Group, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign;Computer Systems Group, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign

  • Venue:
  • C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
  • Year:
  • 1988

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper addresses the issue of fault tolerance in hypercube architectures. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. We propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based fault detection. The basic idea used is to propose low cost fault detection and location schemes using high-level encodings on the data that are tailored to the algorithms being executed on the parallel machine. We have implemented system-level fault detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. In this paper we discuss the results of one such application, namely, Gaussian Elimination which is an extremely useful algorithm for solving a set of linear equations. We have performed extensive studies of fault coverage of our system level fault detection schemes in the presence of finite precision arithmetic which affects our system level encodings.