Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

Authors:
Ragini Narasimhan;Daniel J. Rosenkrantz;S. S. Ravi
Affiliations:
-;-;-
Venue:
International Journal of Parallel Programming
Year:
1999

Citing 24
Cited 0

Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems

IEEE Transactions on Computers - The MIT Press scientific computation series
VLSI array processors

VLSI array processors
An analysis of algorithm-based fault tolerance techniques

Journal of Parallel and Distributed Computing
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
Algorithm-based fault tolerance for matrix inversion with maximum pivoting

Journal of Parallel and Distributed Computing
An introduction to systolic algorithm design

An introduction to systolic algorithm design
Determining performance measures of algorithm-based fault tolerant systems

Journal of Parallel and Distributed Computing
Mantissa-Preserving Operations and Robust Algorithm-Based Fault Tolerance for Matrix Computations

IEEE Transactions on Computers
Robust checksum test in algorithm-based fault tolerance on 2-D processor arrays

Robust checksum test in algorithm-based fault tolerance on 2-D processor arrays
Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
Analysis and Randomized Design of Algorithm-Based Fault Tolerant Multiprocessor Systems Under an Extended Model

IEEE Transactions on Parallel and Distributed Systems
Introduction to Mathematical Theory of Computation

Introduction to Mathematical Theory of Computation
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems

IEEE Transactions on Computers
Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems

IEEE Transactions on Computers
Construction of Check Sets for Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Algorithm-Based Fault Tolerance for FFT Networks

IEEE Transactions on Computers
Synthesis of Algorithm-Based Fault-Tolerant Systems from Dependence Graphs

IEEE Transactions on Parallel and Distributed Systems
Almost Certain Fault Diagnosis Through Algorithm-Based Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems

IEEE Transactions on Parallel and Distributed Systems
Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis

IEEE Transactions on Parallel and Distributed Systems
Complete Tests in Algorithm-Based Fault-Tolerant Matrix Operations on Processor Arrays

Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Analysis and design of algorithm-based fault-tolerant systems

Analysis and design of algorithm-based fault-tolerant systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Algorithm-Based Fault Tolerance (ABFT) is a well known technique for achieving fault and error detection in multiprocessor systems. We examine several issues concerning ABFT systems when the data flow information for the underlying multiprocessor computation is available. Our results show that this finergrained information can be exploited to obtain test schemes involving fewer checks, in some cases, dramatically fewer checks. We address both the analysis and design of ABFT systems when the data flow information is available. The analysis problem for a given ABFT system is to determine the fault detectability and the fault locatability (maximum number of detectable and locatable faulty processors) of the system. We show that the analysis problem can be solved efficiently when the number of faults is fixed. We also address the computational difficulty of this problem when the number of faults is not fixed. The design problem is concerned with the construction of a minimal collection of checks which can detect or locate a specified number of faults for a given multiprocessor computation. We examine some special classes of data flow graphs and establish upper and lower bounds on the number of checks needed to detect or locate a given number of faults. We also address the computational difficulty of this design problem for several cases.