Measurement and modeling of computer reliability as affected by system activity

Authors:
R. K. Iyer;D. J. Rossetti;M. C. Hsueh
Affiliations:
Univ. of Illinois at Urbana-Champaign, Urbana;Stanford Univ., Stanford, CA;Univ. of Illinois at Urbana-Champaign, Urbana
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1986

Citing 1
Cited 32

Performance considerations for the reliability analysis of computing systems.

Performance considerations for the reliability analysis of computing systems.

Measurement-Based Analysis of Error Latency

IEEE Transactions on Computers
Performance Modeling Based on Real Data: A Case Study

IEEE Transactions on Computers - Fault-Tolerant Computing
An Experimental Study of Memory Fault Latency

IEEE Transactions on Computers
Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

IEEE Transactions on Computers
Observations on the Effects of Fault Manifestation as a Function of Workload

IEEE Transactions on Computers - Special issue on fault-tolerant computing
MEASURE+: a measurement-based dependability analysis package

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
HEAT: hierarchical energy analysis tool

DAC '96 Proceedings of the 33rd annual Design Automation Conference
Dependability Analysis of a High-Speed Network Using Software-Implemented Fault Injection and Simulated Fault Injection

IEEE Transactions on Computers
Fault-Tolerant Rate-Monotonic Scheduling

Real-Time Systems
Stress-Based and Path-Based Fault Injection

IEEE Transactions on Computers
Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems

IEEE Transactions on Computers
Fault-Tolerant Adaptive Scheduling for Embedded Real-Time Systems

IEEE Micro
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
The Effect of Workload on the Performance and Availability of Voting Algorithms

MASCOTS '95 Proceedings of the 3rd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
Measurement-based Analysis of Networked System Availability

Performance Evaluation: Origins and Directions
The Interplay of Power Management and Fault Recovery in Real-Time Systems

IEEE Transactions on Computers
Energy management for real-time embedded systems with reliability requirements

Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Exact Fault-Sensitive Feasibility Analysis of Real-Time Tasks

IEEE Transactions on Computers
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Self healing in System-S

Cluster Computing
Energy efficient redundant configurations for real-time parallel reliable servers

Real-Time Systems
Enhanced reliability-aware power management through shared recovery technique

Proceedings of the 2009 International Conference on Computer-Aided Design
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
Reliability-aware dynamic energy management in dependable embedded real-time systems

ACM Transactions on Embedded Computing Systems (TECS)
Experimental evaluation

FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Reliability analysis reloaded: how will we survive?

Proceedings of the Conference on Design, Automation and Test in Europe
Reliability guaranteed energy-aware frame-based task set execution strategy for hard real-time systems

Journal of Systems and Software

Quantified Score

Hi-index	0.03

Visualization

Abstract

This paper demonstrates a practical approach to the study of the failure behavior of computer systems. Particular attention is devoted to the analysis of permanent failures. A number of important techniques, which may have general applicability in both failure and workload analysis, are brought together in this presentation. These include: smeared averaging of the workload data, clustering of like failures, and joint analysis of workload and failures. Approximately 17 percent of all failures affecting the CPU were estimated to be permanent. The manifestation of a permanent failure was found to be strongly correlated with the level and type of workload prior to the failure. Although, in strict terms, the results only relate to the manifestation of permanent failures and not to their occurrence, there are strong indications that permanent failures are both caused and discovered by increased activity. More measurements and experiments are necessary to determine their respective contributions to the measured workload/failure relationship.