A compatible hardware/software reliability prediction model
A compatible hardware/software reliability prediction model
A Statistical Failure/Load Relationship: Results of a Multicomputer Study
IEEE Transactions on Computers
Effects and detection of intermittent failures in digital systems
AFIPS '69 (Fall) Proceedings of the November 18-20, 1969, fall joint computer conference
Measurement-Based Analysis of Error Latency
IEEE Transactions on Computers
Fast timing simulation of transient faults in digital circuits
ICCAD '94 Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design
The Effect of Program Behavior on Fault Observability
IEEE Transactions on Computers
A Methodology for the Rapid Injection of Transient Hardware Errors
IEEE Transactions on Computers
System Dependability Evaluation via a Fault List Generation Algorithm
IEEE Transactions on Computers
A Gate-Level Simulation Environment for Alpha-Particle-Induced Transient Faults
IEEE Transactions on Computers
Fault behavior dictionary for simulation of device-level transients
ICCAD '93 Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design
ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on Web-based modeling and simulation
Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems
IEEE Transactions on Computers
Compiler-Assisted Multiple Instruction Word Retry for VLIW Architectures
IEEE Transactions on Parallel and Distributed Systems
A prototype of a VHDL-based fault injection tool: description and application
Journal of Systems Architecture: the EUROMICRO Journal - Defect and fault tolerance in VLSI Systems
A Reliability Testing Environment for Off-the-Shelf Memory Subsystems
IEEE Design & Test
FOCUS: An Experimental Environment for Fault Sensitivity Analysis
IEEE Transactions on Computers
Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications
IEEE Transactions on Knowledge and Data Engineering
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers
IEEE Transactions on Software Engineering
Fault Injection into VHDL Models: Experimental Validation of a Fault Tolerant Microcomputer System
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Experimental evaluation of the fail-silent behaviour in programs with consistency checks
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
FAMAS: FAult Modeling via Adaptive Simulation
VLSID '97 Proceedings of the Tenth International Conference on VLSI Design: VLSI in Multimedia Applications
A Model for the Analysis of the Fault Injection Process
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reflections on Industry Trends and Experimental Research in Dependability
IEEE Transactions on Dependable and Secure Computing
An error recoverable structure based on complementary logic and alternating-retry
Journal of Computer Science and Technology
Exact Fault-Sensitive Feasibility Analysis of Real-Time Tasks
IEEE Transactions on Computers
Reliability-aware dynamic energy management in dependable embedded real-time systems
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.02 |
This paper proposes and validates a methodology to measure explicitly the increase in the risk of a processor error with increasing workload. By relating the occurrence of a CPU related error to the system activity just prior to the occurrence of an error, the approach measures the dynamic CPU workload/failure relationship. The measurements show that the probability of a CPU related error (the load hazard) increases nonlinearly with increasing workload; i.e., the CPU rapidly deteriorates as end points are reached. The load hazard is observed to be most sensitive to system CPU utilization, the I/O rate, and the interrupt rates. The results are significant because they indicate that it may not be useful to push a system close to its performance limits (the previously accepted operating goal) since what we gain in slightly improved performance is more than offset by the degradation in reliability. Importantly, they also indicate that conventional reliability models need to be reevaluated so as to take system work-load explicity into account.