Measurement-Based Analysis of Error Latency

Authors:
R. Chillarege;R. K. Iyer
Affiliations:
IBM Thomas J. Watson Reaearch Center;Univ. of Illinios at Urbana-Champaign, Urbana
Venue:
IEEE Transactions on Computers
Year:
1987

Citing 4
Cited 13

A measurement-based model for workload dependence of CPU errors

IEEE Transactions on Computers - The MIT Press scientific computation series
Measurement and modeling of computer reliability as affected by system activity

ACM Transactions on Computer Systems (TOCS)
Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
A Statistical Failure/Load Relationship: Results of a Multicomputer Study

IEEE Transactions on Computers

An Experimental Study of Memory Fault Latency

IEEE Transactions on Computers
Fault Injection for Dependability Validation: A Methodology and Some Applications

IEEE Transactions on Software Engineering
FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior Under Faults

IEEE Transactions on Software Engineering - Special issue on software reliability
FERRARI: A Flexible Software-Based Fault and Error Injection System

IEEE Transactions on Computers - Special issue on fault-tolerant computing
The Effect of Program Behavior on Fault Observability

IEEE Transactions on Computers
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis

IEEE Transactions on Computers
Fault Injection

Computer
FOCUS: An Experimental Environment for Fault Sensitivity Analysis

IEEE Transactions on Computers
Self-testing software probe system for failure detection and diagnosis

CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
A Model for the Analysis of the Fault Injection Process

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Automated Rule-Based Diagnosis through a Distributed Monitor System

IEEE Transactions on Dependable and Secure Computing
System test cost modelling based on event rate analysis

ITC'94 Proceedings of the 1994 international conference on Test
Experimental evaluation

FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing

Quantified Score

Hi-index	14.99

Visualization

Abstract

This paper demonstrates a practical methodology for the study of error latency under a real workload. The method is illustrated with sampled data on the physical memory activity, gathered by hardware instrumentation on a VAX 11/780 during the normal workload cycle of the installation. These data are used to simulate fault occurrence and to reconstruct the error discovery process in the system. The technique provides a means to study the system under different workloads and for multiple days. An approach to determine the percentage of undiscovered errors is also developed and a verification of the entire methodology is performed. This study finds that the mean error latency, in the memory containing the operating system, varies by a factor of 10 to 1 (in hours) between the low and high workloads. It is found that of all errors occurring within a day, 70 percent are detected in the same day, 82 percent within the following day, and 91 percent within the third day. The increase in failure rate due to latency is not so much a function of remaining errors but is dependent on whether or not there is a latent error.