Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems

Authors:
Rrithviraj Banerjee;Jacob A. Abraham
Affiliations:
Univ. of Illinois, Urbana, IL;Univ. of Illinois, urbana, IL
Venue:
IEEE Transactions on Computers - The MIT Press scientific computation series
Year:
1986

Citing 5
Cited 27

Structure of Computers and Computations

Structure of Computers and Computations
On non-linear lower bounds in computational complexity

STOC '75 Proceedings of seventh annual ACM symposium on Theory of computing
Fault-secure algorithms for multiple-processor systems

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks

IEEE Transactions on Computers
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers

A novel approach to system-level fault tolerance in hypercube multiprocessors

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors

IEEE Transactions on Software Engineering
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
Algorithm-Based Fault Tolerant Synthesis for Linear Operations

IEEE Transactions on Computers
New Encoding/Decoding Methods for Designing Fault-Tolerant Matrix Operations

IEEE Transactions on Parallel and Distributed Systems
Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems

IEEE Transactions on Computers
Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
A General Method for Maximizing the Error-Detecting Ability of Distributed Algorithms

IEEE Transactions on Parallel and Distributed Systems
Analysis and Randomized Design of Algorithm-Based Fault Tolerant Multiprocessor Systems Under an Extended Model

IEEE Transactions on Parallel and Distributed Systems
Combinatorial Analysis of Check Set Construction for Algorithm-Based Fault Tolerance Systems

Journal of Electronic Testing: Theory and Applications
Post-mortem black-box correctness tests for basic parallel data structures

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

International Journal of Parallel Programming
An Efficient Algorithm-Based Fault Tolerance Design Using the Weighted Data-Check Relationship

IEEE Transactions on Computers
Safety-Critical Systems Built with COTS

Computer
Improved Bounds for Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems

IEEE Transactions on Computers
Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems

IEEE Transactions on Computers
Error Correcting Codes Over Z/sub 2(m/) for Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Construction of Check Sets for Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Synthesis of Algorithm-Based Fault-Tolerant Systems from Dependence Graphs

IEEE Transactions on Parallel and Distributed Systems
Almost Certain Fault Diagnosis Through Algorithm-Based Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems

IEEE Transactions on Parallel and Distributed Systems
Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis

IEEE Transactions on Parallel and Distributed Systems
Cost analysis of a new algorithmic-based soft-error tolerant architecture

DFT '95 Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Correcting soft errors online in LU factorization

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.03

Visualization

Abstract

An important consideration in the design of high- performance multiple processor systems should be in ensuring the correctness of results computed by such complex systems which are extremely prone to transient and intermittent failures. The detection and location of faults and errors concurrently with normal system operation can be achieved through the application of appropriate on-line checks on the results of the computations. This is the domain of algorithm-based fault tolerance, which deals with low-cost system-level fault-tolerance techniques to produce reliable computations in multiple processor systems, by tailoring the fault-tolerance techniques toward specific algorithms. This paper presents a graph-theoretic model for determining upper and lower bounds on the number of checks needed for achieving concurrent fault detection and location. The objective is to estimate ate the overhead in time and the number of processors required for such a scheme. Faults in processors, errors in the data, and checks on the data to detect and locate errors are represented as a tripartite graph. Bounds on the time and processor overhead are obtained by considering a series of subproblems. First, using some crude concepts for t-fault detection and t-fault location, bounds on the maximum size of the error patterns that can arise from such fault patterns are obtained. Using these results, bounds are derived on the number of checks required for error detection and location. Some numerical results are derived from a linear programming formulation.