A Linear Algebraic Model of Algorithm-Based Fault Tolerance

Authors:
J. Anfinson;F. T. Luk
Affiliations:
Cornell Univ., Ithaca, NY;Cornell Univ., Ithaca, NY
Venue:
IEEE Transactions on Computers
Year:
1988

Citing 0
Cited 25

Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors

IEEE Transactions on Software Engineering
New Encoding/Decoding Methods for Designing Fault-Tolerant Matrix Operations

IEEE Transactions on Parallel and Distributed Systems
Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems

IEEE Transactions on Computers
Extending Backward Error Assertions to Tolerance of Large Errors in Floating Point Computations

IEEE Transactions on Computers
Evaluating Reliability Improvements of Fault Tolerant Array Processors Using Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Generalized Algorithm-Based Fault Tolerance: Error Correction via Kalman Estimation

IEEE Transactions on Computers
Safety-Critical Systems Built with COTS

Computer
Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems

IEEE Transactions on Computers
Error Correcting Codes Over Z/sub 2(m/) for Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Reliable Floating-Point Arithmetic Algorithms for Error-Coded Operands

IEEE Transactions on Computers
Computational Arrays with Flexible Redundancy

IEEE Transactions on Computers
Synthesis of Algorithm-Based Fault-Tolerant Systems from Dependence Graphs

IEEE Transactions on Parallel and Distributed Systems
Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems

IEEE Transactions on Parallel and Distributed Systems
Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis

IEEE Transactions on Parallel and Distributed Systems
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Nonconcurrent error correction in the presence of roundoff noise

IEEE Transactions on Circuits and Systems Part I: Regular Papers
Constructing numerically stable real number codes using evolutionary computation

Proceedings of the 12th annual conference on Genetic and evolutionary computation
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Scalable distributed consensus to support MPI fault tolerance

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Fault tolerant preconditioned conjugate gradient for sparse linear system solution

Proceedings of the 26th ACM international conference on Supercomputing
Fault resilience of the algebraic multi-grid solver

Proceedings of the 26th ACM international conference on Supercomputing
Correcting soft errors online in LU factorization

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Self-stabilizing iterative solvers

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

Quantified Score

Hi-index	15.01

Visualization

Abstract

A linear algebraic interpretation is developed for previously proposed algorithm-based fault tolerance schemes. The concepts of distance, code space, and the definitions of detection and correction in the vector space R/sup n/ are explained. The number of errors that can be detected or corrected for a distance-(d+1) code is derived. It is shown why the correction scheme does not work for general weight vectors, and a novel fast-correction algorithm for a weighted distance-5 code is derived.