The Effect of Program Behavior on Fault Observability

Authors:
Nicolas S. Bowen;D. K. Pradhan
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY;Texas A&M Univ., College Station, TX
Venue:
IEEE Transactions on Computers
Year:
1996

Citing 11
Cited 2

A measurement-based model for workload dependence of CPU errors

IEEE Transactions on Computers - The MIT Press scientific computation series
Measurement-Based Analysis of Error Latency

IEEE Transactions on Computers
The Reliability of Single-Error Protected Computer Memories

IEEE Transactions on Computers
Cache Operations by MRU Change

IEEE Transactions on Computers
Influence of Workload on Error Recovery in Random Access Memories

IEEE Transactions on Computers - Fault-Tolerant Computing
An Experimental Study of Memory Fault Latency

IEEE Transactions on Computers
On the Fractal Dimension of Computer Programs and its Application to the Prediction of the Cache Miss Ratio

IEEE Transactions on Computers
Fault Injection for Dependability Validation: A Methodology and Some Applications

IEEE Transactions on Software Engineering
Fault Injection

Computer
A Simulation-Based Study of a Triple Modular Redundant System Using DEFEND

Proceedings of the 5th International GI/ITG/GMA Conference on Fault-Tolerant Computing Systems, Tests, Diagnosis, Fault Treatment
From the fractal dimension of the intermiss gaps to the cache-miss ratio

IBM Journal of Research and Development - Q-Coder adaptive binary arithmetic coder

An Accurate Analysis of the Effects of Soft Errors in the Instruction and Data Caches of a Pipelined Microprocessor

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Scoring and thresholding for availability

IBM Systems Journal

Quantified Score

Hi-index	14.98

Visualization

Abstract

Fault observability based on the behavior of memory references is studied. Traditional studies view memory as one monolithic entity that must completely work to be considered reliable. The usage patterns of a particular program's memory are emphasized here. This paper develops a new model for the successful execution of a program taking into account the usage of the data by extending a cache memory performance model. Three variations, based on well known allocation schemes, are presented (i.e., whether the program's storage is preallocated, dynamically allocated, or constrained in allocation). This is contrasted to traditional memory reliability calculations to show that the actual mean time to failure may be more optimistic when program behavior is considered. It also develops expressions for the probability of unobserved faults. With several studies reporting correlations between increased workloads and increased failure rates, a new theory is proposed here that provides an explanation for this behavior. The model studies several program traces demonstrating that increased workloads could cause an increase of the observed failure rates in the range of 32% to 53%.