Mantissa-Preserving Operations and Robust Algorithm-Based Fault Tolerance for Matrix Computations

Authors:
Shantanu Dutt;Fikri T. Assaad
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
1996

Citing 15
Cited 3

The arithmetic of the digital computer: A new approach

SIAM Review
An analysis of algorithm-based fault tolerance techniques

Journal of Parallel and Distributed Computing
The algebraic eigenvalue problem

The algebraic eigenvalue problem
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays

IEEE Transactions on Computers
Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors

IEEE Transactions on Computers
Interval arithmetic tools and the precise scalar product in numerical analysis

Transactions of the Society for Computer Simulation International
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
What every computer scientist should know about floating-point arithmetic

ACM Computing Surveys (CSUR)
How DEC developed Alpha

IEEE Spectrum
SPARC architecture, assembly language programming, and C

SPARC architecture, assembly language programming, and C
PowerPC 601 and Alpha 21064: A Tale of Two RISCs

Computer
Floating Point Fault Tolerance with Backward Error Assertions

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Computer Arithmetic in Theory and Practice

Computer Arithmetic in Theory and Practice
Rounding Errors in Algebraic Processes

Rounding Errors in Algebraic Processes

Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

International Journal of Parallel Programming
Randomized Algorithms: A System-Level, Poly-Time Analysis of Robust Computation

IEEE Transactions on Computers
Nonconcurrent error correction in the presence of roundoff noise

IEEE Transactions on Circuits and Systems Part I: Regular Papers

Quantified Score

Hi-index	15.00

Visualization

Abstract

A system-level method for achieving fault tolerance called algorithm-based fault tolerance (ABFT) has been proposed by a number of researchers. Many ABFT schemes use a floating-point checksum test to detect computation errors resulting from hardware faults. This makes the tests susceptible to roundoff inaccuracies in floating-point operations, which either cause false alarms or lead to undetected errors. Thresholding of the equality test has been commonly used to avoid false alarms; however, a good threshold that minimizes false alarms without reducing the error coverage significantly is difficult to find, especially when not much is known about the input data. Furthermore, thresholded checksums will inevitably miss lower-bit errors, which can get magnified as a computation such as LU decomposition progresses. Here we develop a theory for applying integer mantissa checksum tests to "mantissa-preserving" floating-point computations. This test is not susceptible to roundoff problems and yields 100% error coverage without false alarms. For computations that are not fully mantissa-preserving, we show how to apply the mantissa checksum test to the mantissa-preserving components of the computation and the floating-point test to the rest of the computation. We apply this general methodology to matrix-matrix multiplication and LU decomposition (using the Gaussian elimination (GE) algorithm), and find that the accuracy of this new "hybrid" testing scheme is substantially higher than the floating-point test with thresholding, and also that its time overhead with respect to the floating-point test is nominal (15% and 9.5% on the average for matrix multiplication and LU decomposition, respectively). The hybrid test can also be easily applied to other computations like matrix inversion that use the GE algorithm. We prove that the mantissa-based integer checksum test for both matrix multiplication and LU decomposition is able to detect at least three errors in the floating-point multiplication component of these computations. For LU decomposition, it is also able to correct a single error in the floating-point multiplies.