Analysis of a Class of Recovery Procedures
IEEE Transactions on Computers
Fault Tolerance: Principles and Practice
Fault Tolerance: Principles and Practice
Performance-Related Reliability Measures for Computing Systems
IEEE Transactions on Computers
IEEE Transactions on Computers
A Continuous-Parameter Markov Model and Detection Procedures for Intermittent Faults
IEEE Transactions on Computers
Reliability Analysis of N-Modular Redundancy Systems with Intermittent and Permanent Faults
IEEE Transactions on Computers
A Bayesian approach to fault classification
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Design and Analysis of an Optimal Instruction-Retry Policy for TMR Controller Computers
IEEE Transactions on Computers
Generalized Hopfield Neural Network for Concurrent Testing
IEEE Transactions on Computers
An Optimal Retry Policy Based on Fault Classification
IEEE Transactions on Computers
A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods
IEEE Transactions on Computers
Determination of an Optimal Retry Time in Multiple-Module Computing Systems
IEEE Transactions on Computers
Hi-index | 15.00 |
The objective of fault-tolerant computing systems is to provide an error-free operation in the presence of faults. The system has to recover from the effects of a fault by employing certain recovery procedures like program rollback, reload, and restart, etc. However, these recovery procedures, result in interruptions in the system's operation, thus reducing the availability of the system for user applications. Fault-tolerant systems for critical applications include, therefore, standby spares that are ready to replace active modules which fail to recover from the effects of a fault. A standby spare may also be used to replace a module suffering from frequent fault occurrences resulting in too many repetitions of the recovery process, in order to increase the availability of the system for user applications. In this case a module switching policy is needed indicating upon a fault occurrence, whether to retry a failing module or switch it out and replace it by a spare, considering the remaining mission time and the probability of a system crash. A module switching policy for dynamic redundancy systems is presented in this paper and the improvement in application-oriented availability due to the use of this policy is illustrated.