The evolution of the DECsystem 10
Communications of the ACM - Special issue on computer architecture
Communications of the ACM - Special issue on computer architecture
Communications of the ACM
Reliability modeling techniques for self-repairing computer systems
ACM '69 Proceedings of the 1969 24th national conference
The switching structure and addressing architecture of an extensible multiprocessor: cm*.
The switching structure and addressing architecture of an extensible multiprocessor: cm*.
Analysis and modeling of transient errors in digital computers
Analysis and modeling of transient errors in digital computers
A compatible hardware/software reliability prediction model
A compatible hardware/software reliability prediction model
The multics system: an examination of its structure
The multics system: an examination of its structure
Theory, Volume 1, Queueing Systems
Theory, Volume 1, Queueing Systems
Cm*: a modular, multi-microprocessor
AFIPS '77 Proceedings of the June 13-16, 1977, national computer conference
VAX/VMS Event Monitoring and Analysis
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
A Reliable OS Kernel for Smart Sensors
COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Volume 01
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
The Role of a Maintenance Processor for a General-Purpose Computer System
IEEE Transactions on Computers
Design and Application of Self-Testing Comparators Implemented with MOS PLA's
IEEE Transactions on Computers
Exact Fault-Sensitive Feasibility Analysis of Real-Time Tasks
IEEE Transactions on Computers
Enhanced reliability-aware power management through shared recovery technique
Proceedings of the 2009 International Conference on Computer-Aided Design
Reliability-aware dynamic energy management in dependable embedded real-time systems
ACM Transactions on Embedded Computing Systems (TECS)
A model for space-correlated failures in large-scale distributed systems
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
A reliable task scheduling scheme for sensor-based real-time operating system
ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Generalized reliability-oriented energy management for real-time embedded applications
Proceedings of the 48th Design Automation Conference
Reducing total energy for reliability-aware DVS algorithms
UIC'11 Proceedings of the 8th international conference on Ubiquitous intelligence and computing
Energy efficient configuration for qos in reliable parallel servers
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Journal of Systems and Software
Hi-index | 14.99 |
In this paper a new modeling methodology to characterize failure processes in digital computers due to hardware transients is presented. The basic assumption made is that system sensitivity to hardware transient errors is a function of critical resources usage. The failure rate of a given resource is approximated by a deterministic function of time, depending on the average workload of that resource, plus a Gaussian process. The probability density function of the time to failure obtained under this assumption has a decreasing hazard function, explaining why decreasing hazard function densities such as the Weibull fit experimental data so well. Data on transient errors obtained from several systems are analyzed. Statistical tests confirm the good fit between decreasing hazard distributions and actual data. Finally, models of common fault-tolerant redundant structures are developed using decreasing hazard function distributions. The analysis indicates significant differences between reliability predictions based on the exponential distribution and those based on decreasing hazard function distributions. Reliability differences of 0.2 and factors greater than 2 in Mission Time Improvement are seen in model results. System designers should be aware of these differences.