A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods

Authors:
K. G. Shin;Hagbae Kim
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
1994

Citing 7
Cited 9

Analysis of a Class of Recovery Procedures

IEEE Transactions on Computers
A watchdog processor based general rollback technique with multiple retries

IEEE Transactions on Software Engineering
On switching policies for modular redundancy fault-tolerant computing systems

IEEE Transactions on Computers
Optimal checkpointing of real-time tasks

IEEE Transactions on Computers
Embedding triple-modular redundancy into a hypercube architecture

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
A RAM Architecture for Concurrent Access and on Chip Testing

IEEE Transactions on Computers
An Optimal Retry Policy Based on Fault Classification

IEEE Transactions on Computers

Design and Analysis of an Optimal Instruction-Retry Policy for TMR Controller Computers

IEEE Transactions on Computers
Sequencing Tasks to Minimize the Effects of Near-Coincident Faults in TMR Controller Computers

IEEE Transactions on Computers
A Replication Technique Based on a Functional and Attribute Grammar Computation Model

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Probabilistic Schedulability Analysis of Harmonic Multi-Task Systems with Dual-Modular Temporal Redundancy

Real-Time Systems
Self-Checking Voter for High Speed TMR Systems

Journal of Electronic Testing: Theory and Applications
Energy-efficient soft error-tolerant digital signal processing

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Time separations of cyclic event rule systems with min-max timing constraints

Theoretical Computer Science
Energy efficient redundant configurations for real-time parallel reliable servers

Real-Time Systems
Energy efficient configuration for qos in reliable parallel servers

EDCC'05 Proceedings of the 5th European conference on Dependable Computing

Quantified Score

Hi-index	14.99

Visualization

Abstract

Failure to establish a majority among the processing modules in a triple modular redundant (TMR) system, called a TMR failure, is detected by using two voters and a disagreement detector. Assuming that no more than one module becomes permanently faulty during the execution of a task, Re-execution of the task on the Same HardWare (RSHW) upon detection of a TMR failure becomes a cost-effective recovery method, because 1) the TMR system can mask the effects of one faulty module while RSHW can recover from nonpermanent faults, and 2) system reconfiguration-Replace the faulty HardWare, reload, and Restart (RHWR)-is expensive both in time and hardware. We propose an adaptive recovery method for TMR failures by "optimally" choosing either RSHW or RHWR based on the estimation of the costs involved. We apply the Bayes theorem to update the likelihoods of all possible states in the TMR system with each voting result. Upon detection of a TMR failure, the expected cost of RSHW is derived with these likelihoods and then compared with that of RHWR. RSHW will continue either until it recovers from the TMR failure or until the expected cost of RSHW becomes larger than that of RHWR. As the number of unsuccessful RSHW's increases, the probability of permanent fault(s) having caused the TMR failure will increase, which will, in turn, increase the cost of RSHW. Our simulation results show that the proposed method outperforms the conventional reconfiguration method using only RHWR under various conditions.