From defects to failures: a view of dependable computing

Authors:
Behrooz Parhami
Affiliations:
Dept. of Electrical and Computer Engineering, University of Cafifomia, Santa Barbara, CA
Venue:
ACM SIGARCH Computer Architecture News - Special Issue: Architectural Support for Operating Systems
Year:
1988

Citing 22
Cited 2

Software system testing and quality assurance

Software system testing and quality assurance
Fault tolerant and fault testable hardware design

Fault tolerant and fault testable hardware design
Logic testing and design for testability

Logic testing and design for testability
Logic design principles with emphasis on testable semicustom circuits

Logic design principles with emphasis on testable semicustom circuits
Software safety: why, what, and how

ACM Computing Surveys (CSUR)
Computer system reliability and nuclear war

Communications of the ACM
Fault-Tolerance Considerations in Large, Multiple-Processor Systems

Computer
Fault Tolerance Techniques for Array Structures Used in Supercomputing

Computer
Fault-tolerant computing: theory and techniques; Vol. 2

Fault-tolerant computing: theory and techniques; Vol. 2
System diagnosis

Fault-tolerant computing: theory and techniques; Vol. 2
Some principles and techniques for designing safe systems

ACM SIGSOFT Software Engineering Notes
Concurrency control and reliability in distributed systems

Concurrency control and reliability in distributed systems
The Byzantine generals

Concurrency control and reliability in distributed systems
The structure of System/88, a fault-tolerant computer

IBM Systems Journal
A multi-level view of dependable computing

Computers and Electrical Engineering
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
Risks to the public in computer systems

ACM SIGSOFT Software Engineering Notes
Error Coding for Arithmetic Processors

Error Coding for Arithmetic Processors
Fault-Tolerant Computing

Fault-Tolerant Computing
Quality Assurance for Computer Software

Quality Assurance for Computer Software
Microprogrammed Control and Reliable Design of Small Computers

Microprogrammed Control and Reliable Design of Small Computers
The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design

IEEE Transactions on Computers

References

Dependability metrics
Dependability of social network groups

Proceedings of the CUBE International Information Technology Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

A unified framework and terminology for the study of computer system dependability is presented. Impairments to dependability are viewed from six abstraction levels. It is argued that all of these levels are useful, in the sense that proven dependability procurement techniques can be applied at each level, and that it is beneficial to have distinct, precisely defined terminology for describing impairments to and procurement strategies for computer system dependability at these levels. The six levels in the proposed framework are:1. Defect level or component level, dealing with deviant atomic parts.2. Fault level or logic level, dealing with deviant logic values or path selections.3. Error level or information level, dealing with deviant internal states.4. Malfunction level or system level, dealing with deviant functional behavior.5. Degradation level or service level, dealing with deviant performance.6. Failure level or result level, dealing with deviant outputs or actions.Briefly, a hardware or software component may be defective (perfect hardware may also become defective due to wear and aging). Certain system states expose the defect, resulting in the development of a logic-level fault. Information flowing within a faulty system may become contaminated, leading to the presence of an error. An erroneous system state may result in a subsystem malfunction. Automatic or manually controlled reconfiguration can isolate or bypass the malfunctioning subsystem but may lead to a degraded performance or service. Serious performance degradation may lead to a result-level system failure when untrustworthy or untimely results are produced. Finally, a failed computer system can have adverse effects on the larger societal or corporate system into which it is incorporated.