Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

  • Authors:
  • Ragini Narasimhan;Daniel J. Rosenkrantz;S. S. Ravi

  • Affiliations:
  • -;-;-

  • Venue:
  • International Journal of Parallel Programming
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Algorithm-Based Fault Tolerance (ABFT) is a well known technique for achieving fault and error detection in multiprocessor systems. We examine several issues concerning ABFT systems when the data flow information for the underlying multiprocessor computation is available. Our results show that this finergrained information can be exploited to obtain test schemes involving fewer checks, in some cases, dramatically fewer checks. We address both the analysis and design of ABFT systems when the data flow information is available. The analysis problem for a given ABFT system is to determine the fault detectability and the fault locatability (maximum number of detectable and locatable faulty processors) of the system. We show that the analysis problem can be solved efficiently when the number of faults is fixed. We also address the computational difficulty of this problem when the number of faults is not fixed. The design problem is concerned with the construction of a minimal collection of checks which can detect or locate a specified number of faults for a given multiprocessor computation. We examine some special classes of data flow graphs and establish upper and lower bounds on the number of checks needed to detect or locate a given number of faults. We also address the computational difficulty of this design problem for several cases.