A Bayesian approach to fault classification

Authors:
Tein-Hsiang Lin;Kang G. Shin
Affiliations:
Department of Electrical and Computer Engineering, State University of New York at Buffalo, Buffalo, New York;Real-Time Computing Laboratory, Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, Michigan
Venue:
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Year:
1990

Citing 4
Cited 3

Analysis of a Class of Recovery Procedures

IEEE Transactions on Computers
On switching policies for modular redundancy fault-tolerant computing systems

IEEE Transactions on Computers
Optimal design and use of retry in fault-tolerant computer systems

Journal of the ACM (JACM)
Fault Tolerance: Principles and Practice

Fault Tolerance: Principles and Practice

Damage Assessment for Optimal Rollback Recovery

IEEE Transactions on Computers
An Optimal Retry Policy Based on Fault Classification

IEEE Transactions on Computers
A Step Towards Fault Tolerance for Multi-Agent Systems

Languages, Methodologies and Development Tools for Multi-Agent Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

According to their temporal behavior, faults in computer systems are classified into permanent, intermittent, and transient faults. Since it is impossible to identify the type of a fault upon its first detection, the common practice is to retry the failed instruction one or more times and then use other fault recovery methods, such as rollback or restart, if the retry is not successful. To determine an “optimal” (in some sense) number of retries, we need to know several fault parameters, which can be estimated only after classifying all the faults detected in the past.In this paper we propose a new fault classification scheme which assigns a fault type to each detected fault based on its detection time, the outcome of retry, and its detection symptom. This classification procedure utilizes the Bayesian decision theory to sequentially update the estimation of fault parameters whenever a detected fault is classified. An important advantage of this classification is the early identification of presence of an intermittent fault so that appropriate measures can be taken before it causes a serious damage to the system. To assess the goodness of the proposed scheme, the probability of incorrect classification is also analyzed and compared with simulation results.