Location of a Faulty Module in a Computing System

Authors:
Tein-Hsiang Lin;Kang G. Shin
Affiliations:
Buffalo State Univ. of New York, New York;Univ. of Michigan, Ann Arbor
Venue:
IEEE Transactions on Computers
Year:
1990

Citing 5
Cited 1

Measurement and Application of Fault Latency

IEEE Transactions on Computers - The MIT Press scientific computation series
Performance analysis of a fault detection scheme in multiprocessor systems

SIGMETRICS '87 Proceedings of the 1987 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A Generalized Theory for System Level Diagnosis

IEEE Transactions on Computers
Modeling and Measurement of Error Propagation in a Multimodule Computing System

IEEE Transactions on Computers
A comparison connection assignment for diagnosis of multiprocessor systems

ISCA '80 Proceedings of the 7th annual symposium on Computer Architecture

Damage Assessment for Optimal Rollback Recovery

IEEE Transactions on Computers

Quantified Score

Hi-index	14.98

Visualization

Abstract

Considering the interplay between different phases of fault tolerance, a new problem of locating a faulty module in a computing system is formulated and solved. First, the probability of each module being faulty, or faulty probability, is calculated using the likelihood principle from the model parameters for fault detection, diagnostics, error propagation, and error detection. Then, based on the faulty probabilities and a given required diagnostic coverage, the order in which modules are to be diagnosed and the maximum time allotted to diagnose each module are determined by minimizing the average total diagnostic time. An example is presented and analyzed to answer the question of whether or not a system should delay the diagnosis upon detection of an error until more errors are detected.