On switching policies for modular redundancy fault-tolerant computing systems

Authors:
M. Berg;I. Koren
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
1987

Citing 6
Cited 6

Analysis of a Class of Recovery Procedures

IEEE Transactions on Computers
Fault Tolerance: Principles and Practice

Fault Tolerance: Principles and Practice
Performance-Related Reliability Measures for Computing Systems

IEEE Transactions on Computers
Reliability and Availability Models for Maintained Systems Featuring Hardware Failures and Design Faults

IEEE Transactions on Computers
A Continuous-Parameter Markov Model and Detection Procedures for Intermittent Faults

IEEE Transactions on Computers
Reliability Analysis of N-Modular Redundancy Systems with Intermittent and Permanent Faults

IEEE Transactions on Computers

A Bayesian approach to fault classification

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Design and Analysis of an Optimal Instruction-Retry Policy for TMR Controller Computers

IEEE Transactions on Computers
Generalized Hopfield Neural Network for Concurrent Testing

IEEE Transactions on Computers
An Optimal Retry Policy Based on Fault Classification

IEEE Transactions on Computers
A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods

IEEE Transactions on Computers
Determination of an Optimal Retry Time in Multiple-Module Computing Systems

IEEE Transactions on Computers

Quantified Score

Hi-index	15.00

Visualization

Abstract

The objective of fault-tolerant computing systems is to provide an error-free operation in the presence of faults. The system has to recover from the effects of a fault by employing certain recovery procedures like program rollback, reload, and restart, etc. However, these recovery procedures, result in interruptions in the system's operation, thus reducing the availability of the system for user applications. Fault-tolerant systems for critical applications include, therefore, standby spares that are ready to replace active modules which fail to recover from the effects of a fault. A standby spare may also be used to replace a module suffering from frequent fault occurrences resulting in too many repetitions of the recovery process, in order to increase the availability of the system for user applications. In this case a module switching policy is needed indicating upon a fault occurrence, whether to retry a failing module or switch it out and replace it by a spare, considering the remaining mission time and the probability of a system crash. A module switching policy for dynamic redundancy systems is presented in this paper and the improvement in application-oriented availability due to the use of this policy is illustrated.