Performance considerations for the reliability analysis of computing systems.
Performance considerations for the reliability analysis of computing systems.
Measurement-Based Analysis of Error Latency
IEEE Transactions on Computers
Performance Modeling Based on Real Data: A Case Study
IEEE Transactions on Computers - Fault-Tolerant Computing
An Experimental Study of Memory Fault Latency
IEEE Transactions on Computers
Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
IEEE Transactions on Computers
Observations on the Effects of Fault Manifestation as a Function of Workload
IEEE Transactions on Computers - Special issue on fault-tolerant computing
MEASURE+: a measurement-based dependability analysis package
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
HEAT: hierarchical energy analysis tool
DAC '96 Proceedings of the 33rd annual Design Automation Conference
IEEE Transactions on Computers
Fault-Tolerant Rate-Monotonic Scheduling
Real-Time Systems
Stress-Based and Path-Based Fault Injection
IEEE Transactions on Computers
Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems
IEEE Transactions on Computers
Dependability Measurement and Modeling of a Multicomputer System
IEEE Transactions on Computers
The Effect of Workload on the Performance and Availability of Voting Algorithms
MASCOTS '95 Proceedings of the 3rd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
Measurement-based Analysis of Networked System Availability
Performance Evaluation: Origins and Directions
The Interplay of Power Management and Fault Recovery in Real-Time Systems
IEEE Transactions on Computers
Energy management for real-time embedded systems with reliability requirements
Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
Exact Fault-Sensitive Feasibility Analysis of Real-Time Tasks
IEEE Transactions on Computers
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Cluster Computing
Enhanced reliability-aware power management through shared recovery technique
Proceedings of the 2009 International Conference on Computer-Aided Design
An Analysis of Traces from a Production MapReduce Cluster
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
Reliability-aware dynamic energy management in dependable embedded real-time systems
ACM Transactions on Embedded Computing Systems (TECS)
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Reliability analysis reloaded: how will we survive?
Proceedings of the Conference on Design, Automation and Test in Europe
Journal of Systems and Software
Hi-index | 0.03 |
This paper demonstrates a practical approach to the study of the failure behavior of computer systems. Particular attention is devoted to the analysis of permanent failures. A number of important techniques, which may have general applicability in both failure and workload analysis, are brought together in this presentation. These include: smeared averaging of the workload data, clustering of like failures, and joint analysis of workload and failures. Approximately 17 percent of all failures affecting the CPU were estimated to be permanent. The manifestation of a permanent failure was found to be strongly correlated with the level and type of workload prior to the failure. Although, in strict terms, the results only relate to the manifestation of permanent failures and not to their occurrence, there are strong indications that permanent failures are both caused and discovered by increased activity. More measurements and experiments are necessary to determine their respective contributions to the measured workload/failure relationship.