A measurement-based model for workload dependence of CPU errors

Authors:
Ravishankar K. Iyer;David J. Rossetti
Affiliations:
Univ. of Illinois, Urbana, IL;Metaphor Computer Systems, Mountain View, CA
Venue:
IEEE Transactions on Computers - The MIT Press scientific computation series
Year:
1986

Citing 3
Cited 23

A compatible hardware/software reliability prediction model

A compatible hardware/software reliability prediction model
A Statistical Failure/Load Relationship: Results of a Multicomputer Study

IEEE Transactions on Computers
Effects and detection of intermittent failures in digital systems

AFIPS '69 (Fall) Proceedings of the November 18-20, 1969, fall joint computer conference

Measurement-Based Analysis of Error Latency

IEEE Transactions on Computers
Fast timing simulation of transient faults in digital circuits

ICCAD '94 Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design
The Effect of Program Behavior on Fault Observability

IEEE Transactions on Computers
A Methodology for the Rapid Injection of Transient Hardware Errors

IEEE Transactions on Computers
System Dependability Evaluation via a Fault List Generation Algorithm

IEEE Transactions on Computers
A Gate-Level Simulation Environment for Alpha-Particle-Induced Transient Faults

IEEE Transactions on Computers
Fault behavior dictionary for simulation of device-level transients

ICCAD '93 Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design
Performance and dependability evaluation of scalable massively parallel computer systems with conjoint simulation

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on Web-based modeling and simulation
Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems

IEEE Transactions on Computers
Compiler-Assisted Multiple Instruction Word Retry for VLIW Architectures

IEEE Transactions on Parallel and Distributed Systems
A prototype of a VHDL-based fault injection tool: description and application

Journal of Systems Architecture: the EUROMICRO Journal - Defect and fault tolerance in VLSI Systems
A Reliability Testing Environment for Off-the-Shelf Memory Subsystems

IEEE Design & Test
FOCUS: An Experimental Environment for Fault Sensitivity Analysis

IEEE Transactions on Computers
Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications

IEEE Transactions on Knowledge and Data Engineering
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers

IEEE Transactions on Software Engineering
Fault Injection into VHDL Models: Experimental Validation of a Fault Tolerant Microcomputer System

EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Experimental evaluation of the fail-silent behaviour in programs with consistency checks

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
FAMAS: FAult Modeling via Adaptive Simulation

VLSID '97 Proceedings of the Tenth International Conference on VLSI Design: VLSI in Multimedia Applications
A Model for the Analysis of the Fault Injection Process

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reflections on Industry Trends and Experimental Research in Dependability

IEEE Transactions on Dependable and Secure Computing
An error recoverable structure based on complementary logic and alternating-retry

Journal of Computer Science and Technology
Exact Fault-Sensitive Feasibility Analysis of Real-Time Tasks

IEEE Transactions on Computers
Reliability-aware dynamic energy management in dependable embedded real-time systems

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper proposes and validates a methodology to measure explicitly the increase in the risk of a processor error with increasing workload. By relating the occurrence of a CPU related error to the system activity just prior to the occurrence of an error, the approach measures the dynamic CPU workload/failure relationship. The measurements show that the probability of a CPU related error (the load hazard) increases nonlinearly with increasing workload; i.e., the CPU rapidly deteriorates as end points are reached. The load hazard is observed to be most sensitive to system CPU utilization, the I/O rate, and the interrupt rates. The results are significant because they indicate that it may not be useful to push a system close to its performance limits (the previously accepted operating goal) since what we gain in slightly improved performance is more than offset by the degradation in reliability. Importantly, they also indicate that conventional reliability models need to be reevaluated so as to take system work-load explicity into account.