Structure of Computers and Computations
Structure of Computers and Computations
On non-linear lower bounds in computational complexity
STOC '75 Proceedings of seventh annual ACM symposium on Theory of computing
Fault-secure algorithms for multiple-processor systems
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks
IEEE Transactions on Computers
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
A novel approach to system-level fault tolerance in hypercube multiprocessors
C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
IEEE Transactions on Software Engineering
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
IEEE Transactions on Computers
Algorithm-Based Fault Tolerant Synthesis for Linear Operations
IEEE Transactions on Computers
New Encoding/Decoding Methods for Designing Fault-Tolerant Matrix Operations
IEEE Transactions on Parallel and Distributed Systems
Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems
IEEE Transactions on Computers
Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
A General Method for Maximizing the Error-Detecting Ability of Distributed Algorithms
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Combinatorial Analysis of Check Set Construction for Algorithm-Based Fault Tolerance Systems
Journal of Electronic Testing: Theory and Applications
Post-mortem black-box correctness tests for basic parallel data structures
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance
International Journal of Parallel Programming
An Efficient Algorithm-Based Fault Tolerance Design Using the Weighted Data-Check Relationship
IEEE Transactions on Computers
Improved Bounds for Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems
IEEE Transactions on Computers
Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems
IEEE Transactions on Computers
Error Correcting Codes Over Z/sub 2(m/) for Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
Construction of Check Sets for Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
Synthesis of Algorithm-Based Fault-Tolerant Systems from Dependence Graphs
IEEE Transactions on Parallel and Distributed Systems
Almost Certain Fault Diagnosis Through Algorithm-Based Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Cost analysis of a new algorithmic-based soft-error tolerant architecture
DFT '95 Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Correcting soft errors online in LU factorization
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Hi-index | 0.03 |
An important consideration in the design of high- performance multiple processor systems should be in ensuring the correctness of results computed by such complex systems which are extremely prone to transient and intermittent failures. The detection and location of faults and errors concurrently with normal system operation can be achieved through the application of appropriate on-line checks on the results of the computations. This is the domain of algorithm-based fault tolerance, which deals with low-cost system-level fault-tolerance techniques to produce reliable computations in multiple processor systems, by tailoring the fault-tolerance techniques toward specific algorithms. This paper presents a graph-theoretic model for determining upper and lower bounds on the number of checks needed for achieving concurrent fault detection and location. The objective is to estimate ate the overhead in time and the number of processors required for such a scheme. Faults in processors, errors in the data, and checks on the data to detect and locate errors are represented as a tripartite graph. Bounds on the time and processor overhead are obtained by considering a series of subproblems. First, using some crude concepts for t-fault detection and t-fault location, bounds on the maximum size of the error patterns that can arise from such fault patterns are obtained. Using these results, bounds are derived on the number of checks required for error detection and location. Some numerical results are derived from a linear programming formulation.