Measurement and modeling of computer reliability as affected by system activity
ACM Transactions on Computer Systems (TOCS)
Design and evaluation of an on-line predictive diagnostic system
Design and evaluation of an on-line predictive diagnostic system
Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults
IEEE Transactions on Computers
Diagnosing Rediscovered Software Problems Using Symptoms
IEEE Transactions on Software Engineering
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Dependability Measurement and Modeling of a Multicomputer System
IEEE Transactions on Computers
Measurement-based Analysis of Networked System Availability
Performance Evaluation: Origins and Directions
A comparative analysis of event tupling schemes
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Reflections on Industry Trends and Experimental Research in Dependability
IEEE Transactions on Dependable and Secure Computing
Effective Fault Treatment for Improving the Dependability of COTS and Legacy-Based Applications
IEEE Transactions on Dependable and Secure Computing
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters
IEEE Transactions on Dependable and Secure Computing
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
Proactive management of software aging
IBM Journal of Research and Development
EVEREST+: run-time SLA violations prediction
Proceedings of the 5th International Workshop on Middleware for Service Oriented Computing
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
OS-level hang detection in complex software systems
International Journal of Critical Computer-Based Systems
Operating system support to detect application hangs
VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
Fault prediction under the microscope: a closer look into HPC systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 14.99 |
A methodology is proposed for recognizing the symptoms of persistent problems in large systems. The system error rate is used to identify the error states among which relationships may exist. Statistical techniques are used to validate and quantify the strength of the relationship among these error states. As input, the approach takes the raw error logs containing a single entry for each error that is detected as an isolated event. As output, it produces a list of symptoms that characterize persistent errors. Thus, given a failure, it is determined whether the failure is an intermittent manifestation of a common fault or whether it is an isolated (transient) incident. The technique is shown to work on two CYBER systems and on IBM 3081 multiprocessor system. Comparisons to real failure/repair information obtained from field engineers show that, in about 85% of the cases, the error symptoms recognized by this approach correspond to real problems. The remaining 15% of the cases, although not directly supported by field data, are confirmed as being valid problems.