An Optimal Retry Policy Based on Fault Classification

Authors:
Tein-Hsiang Lin;K. G. Shin
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
1994

Citing 5
Cited 5

Analysis of a Class of Recovery Procedures

IEEE Transactions on Computers
On switching policies for modular redundancy fault-tolerant computing systems

IEEE Transactions on Computers
Optimal design and use of retry in fault-tolerant computer systems

Journal of the ACM (JACM)
A Bayesian approach to fault classification

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Fault Tolerance: Principles and Practice

Fault Tolerance: Principles and Practice

Design and Analysis of an Optimal Instruction-Retry Policy for TMR Controller Computers

IEEE Transactions on Computers
Damage Assessment for Optimal Rollback Recovery

IEEE Transactions on Computers
A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods

IEEE Transactions on Computers
Determination of an Optimal Retry Time in Multiple-Module Computing Systems

IEEE Transactions on Computers
Probabilistic Schedulability Analysis of Harmonic Multi-Task Systems with Dual-Modular Temporal Redundancy

Real-Time Systems

Quantified Score

Hi-index	14.99

Visualization

Abstract

An optimal retry policy in a computer system is usually derived under the unrealistic assumption that fault characteristics are known a priori and remain unchanged throughout the mission lifetime. In such a case, the optimal retry period depends only upon the system's status at the time of fault detection. We propose to remedy this deficiency by formulating the optimal retry problem as a Bayesian decision problem where not only the time of fault detection but also the results of earlier retries are used to estimate the current fault characteristics. Previous knowledge about fault characteristics is represented by the prior distributions of fault-related parameters which are updated whenever new samples are obtained from retry and detection mechanisms. A new fault classification scheme is proposed to assign a temporal fault type (i.e. permanent, intermittent or transient) to each detected fault so that the corresponding fault parameters can be estimated. The estimated fault parameters are then used to derive the optimal retry period that minimizes the mean task completion time. Efficient algorithms are developed to determine the optimal retry period online upon detection of each fault. To evaluate the goodness of the proposed retry policy, it is compared with, and is always found to outperform, a number of fixed retry period policies.