Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

Authors:
Ravishankar K. Iyer;Luke T. Young;P. V. Krishna Iyer
Affiliations:
Univ. of Illinois at Urbana Champaign, Urbana;Univ. of Illinois at Urbana Champaign, Urbana;-
Venue:
IEEE Transactions on Computers
Year:
1990

Citing 2
Cited 18

Measurement and modeling of computer reliability as affected by system activity

ACM Transactions on Computer Systems (TOCS)
Design and evaluation of an on-line predictive diagnostic system

Design and evaluation of an on-line predictive diagnostic system

Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults

IEEE Transactions on Computers
Diagnosing Rediscovered Software Problems Using Symptoms

IEEE Transactions on Software Engineering
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
Measurement-based Analysis of Networked System Availability

Performance Evaluation: Origins and Directions
A comparative analysis of event tupling schemes

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Reflections on Industry Trends and Experimental Research in Dependability

IEEE Transactions on Dependable and Secure Computing
Effective Fault Treatment for Improving the Dependability of COTS and Legacy-Based Applications

IEEE Transactions on Dependable and Secure Computing
A Compiler-Enabled Model- and Measurement-Driven Adaptation Environment for Dependability and Performance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters

IEEE Transactions on Dependable and Secure Computing
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
Proactive management of software aging

IBM Journal of Research and Development
EVEREST+: run-time SLA violations prediction

Proceedings of the 5th International Workshop on Middleware for Service Oriented Computing
Experimental evaluation

FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
OS-level hang detection in complex software systems

International Journal of Critical Computer-Based Systems
Operating system support to detect application hangs

VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	14.99

Visualization

Abstract

A methodology is proposed for recognizing the symptoms of persistent problems in large systems. The system error rate is used to identify the error states among which relationships may exist. Statistical techniques are used to validate and quantify the strength of the relationship among these error states. As input, the approach takes the raw error logs containing a single entry for each error that is detected as an isolated event. As output, it produces a list of symptoms that characterize persistent errors. Thus, given a failure, it is determined whether the failure is an intermittent manifestation of a common fault or whether it is an isolated (transient) incident. The technique is shown to work on two CYBER systems and on IBM 3081 multiprocessor system. Comparisons to real failure/repair information obtained from field engineers show that, in about 85% of the cases, the error symptoms recognized by this approach correspond to real problems. The remaining 15% of the cases, although not directly supported by field data, are confirmed as being valid problems.