Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems
IEEE Transactions on Computers - The MIT Press scientific computation series
A Fault-Tolerant Systolic Sorter
IEEE Transactions on Computers
An analysis of algorithm-based fault tolerance techniques
Journal of Parallel and Distributed Computing
Fault-Tolerant Matrix Triangularizations on Systolic Arrays
IEEE Transactions on Computers
A guided tour of Chernoff bounds
Information Processing Letters
Probabilistic Evaluation of Online Checks in Fault-Tolerant Multiprocessor Systems
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Diagnosing Arbitrarily Connected Parallel Computers with High Probability
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Efficient Diagnosis of Multiprocessor Systems Under Probabilistic Models
IEEE Transactions on Computers
Intermittent Fault Diagnosis in Multiprocessor Systems
IEEE Transactions on Computers
Concurrent Error Detection Using Watchdog Processors-A Survey
IEEE Transactions on Computers
Optimal Design of Checks for Error Detection and Location in Fault Tolerant Multiprocessors Systems
Proceedings of the 5th International GI/ITG/GMA Conference on Fault-Tolerant Computing Systems, Tests, Diagnosis, Fault Treatment
Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance
International Journal of Parallel Programming
A Hierarchical Modeling and Analysis for Grid Service Reliability
IEEE Transactions on Computers
Hi-index | 0.00 |
Algorithm-based fault tolerance has been proposed as a technique to detect incorrectcomputations in multiprocessor systems. In algorithm-based fault tolerance, processorsproduce data elements that are checked by concurrent error detection mechanisms. Weinvestigate the efficacy of this approach for diagnosis of processor faults. Becausechecks are performed on data elements, the problem of location of data errors must firstbe solved. We propose a probabilistic model for the faults and errors in a multiprocessorsystem and use it to evaluate the probabilities of correct error location and faultdiagnosis. We investigate the number of checks that are necessary to guarantee errorlocation with high probability. We also give specific check assignments that accomplishthis goal. We then consider the problem of fault diagnosis when the locations oferroneous data elements are known. Previous work on fault diagnosis required that thedata sets produced by different processors be disjoint. We show, for the first time, thatfault diagnosis is possible with high probability, even in systems where processorscombine to produce individual data elements.