Derivation and Calibration of a Transient Error Reliability Model

Authors:
X. Castillo;S. R. McConnel;D. P. Siewiorek
Affiliations:
Asociacion I.T.P.;-;-
Venue:
IEEE Transactions on Computers
Year:
1982

Citing 10
Cited 16

The evolution of the DECsystem 10

Communications of the ACM - Special issue on computer architecture
The CRAY-1 computer system

Communications of the ACM - Special issue on computer architecture
The UNIX time-sharing system

Communications of the ACM
Reliability modeling techniques for self-repairing computer systems

ACM '69 Proceedings of the 1969 24th national conference
The switching structure and addressing architecture of an extensible multiprocessor: cm*.

The switching structure and addressing architecture of an extensible multiprocessor: cm*.
Analysis and modeling of transient errors in digital computers

Analysis and modeling of transient errors in digital computers
A compatible hardware/software reliability prediction model

A compatible hardware/software reliability prediction model
The multics system: an examination of its structure

The multics system: an examination of its structure
Theory, Volume 1, Queueing Systems

Theory, Volume 1, Queueing Systems
Cm*: a modular, multi-microprocessor

AFIPS '77 Proceedings of the June 13-16, 1977, national computer conference

VAX/VMS Event Monitoring and Analysis

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
A Reliable OS Kernel for Smart Sensors

COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Volume 01
Dynamic Power-Aware Scheduling Algorithms for Real-Time Task Sets with Fault-Tolerance in Parallel and Distributed Computing Environment

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
The Role of a Maintenance Processor for a General-Purpose Computer System

IEEE Transactions on Computers
Design and Application of Self-Testing Comparators Implemented with MOS PLA's

IEEE Transactions on Computers
Exact Fault-Sensitive Feasibility Analysis of Real-Time Tasks

IEEE Transactions on Computers
Energy efficient redundant configurations for real-time parallel reliable servers

Real-Time Systems
Enhanced reliability-aware power management through shared recovery technique

Proceedings of the 2009 International Conference on Computer-Aided Design
Reliability-aware dynamic energy management in dependable embedded real-time systems

ACM Transactions on Embedded Computing Systems (TECS)
A model for space-correlated failures in large-scale distributed systems

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
A reliable task scheduling scheme for sensor-based real-time operating system

ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science
Coactive scheduling and checkpoint determination during high level synthesis of self-recovering microarchitectures

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Generalized reliability-oriented energy management for real-time embedded applications

Proceedings of the 48th Design Automation Conference
Reducing total energy for reliability-aware DVS algorithms

UIC'11 Proceedings of the 8th international conference on Ubiquitous intelligence and computing
Energy efficient configuration for qos in reliable parallel servers

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Reliability guaranteed energy-aware frame-based task set execution strategy for hard real-time systems

Journal of Systems and Software

Quantified Score

Hi-index	14.99

Visualization

Abstract

In this paper a new modeling methodology to characterize failure processes in digital computers due to hardware transients is presented. The basic assumption made is that system sensitivity to hardware transient errors is a function of critical resources usage. The failure rate of a given resource is approximated by a deterministic function of time, depending on the average workload of that resource, plus a Gaussian process. The probability density function of the time to failure obtained under this assumption has a decreasing hazard function, explaining why decreasing hazard function densities such as the Weibull fit experimental data so well. Data on transient errors obtained from several systems are analyzed. Statistical tests confirm the good fit between decreasing hazard distributions and actual data. Finally, models of common fault-tolerant redundant structures are developed using decreasing hazard function distributions. The analysis indicates significant differences between reliability predictions based on the exponential distribution and those based on decreasing hazard function distributions. Reliability differences of 0.2 and factors greater than 2 in Mission Time Improvement are seen in model results. System designers should be aware of these differences.