Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems

Authors:
Amber Roy-Chowdhury;Prithviraj Banerjee
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
1996

Citing 13
Cited 0

Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems

IEEE Transactions on Computers - The MIT Press scientific computation series
Fault-tolerant computing: theory and techniques; Vol. 2

Fault-tolerant computing: theory and techniques; Vol. 2
Fault-Tolerant FFT Networks

IEEE Transactions on Computers
A Fault-Tolerant FFT Processor

IEEE Transactions on Computers
Fault-Tolerant Matrix Triangularizations on Systolic Arrays

IEEE Transactions on Computers
A Linear Algebraic Model of Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays

IEEE Transactions on Computers
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
The analysis and synthesis of efficient algorithm-based error detection schemes for hypercube multiprocessors

The analysis and synthesis of efficient algorithm-based error detection schemes for hypercube multiprocessors
Probabilistic Evaluation of Online Checks in Fault-Tolerant Multiprocessor Systems

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems

IEEE Transactions on Computers
Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems

IEEE Transactions on Computers
Analysis and design of algorithm-based fault-tolerant systems

Analysis and design of algorithm-based fault-tolerant systems

Quantified Score

Hi-index	14.98

Visualization

Abstract

Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a 16-processor Intel iPSC/2 hypercube multicomputer, which demonstrate acceptably low overheads for the single and double fault location and recovery cases.