A comparative analysis of event tupling schemes

Authors:
M. F. Buckley;D. P. Siewiorek
Affiliations:
-;-
Venue:
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Year:
1996

Citing 8
Cited 5

Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

IEEE Transactions on Computers
Computer event monitoring and analysis

Computer event monitoring and analysis
Recognition of error symptoms in large systems

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
VAX/VMS Event Monitoring and Analysis

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Design and evaluation of an on-line predictive diagnostic system

Design and evaluation of an on-line predictive diagnostic system
Model for transient and permanent error-detection and fault-isolation coverage

IBM Journal of Research and Development
Automated diagnostic methodology for the IBM 3081 processor complex

IBM Journal of Research and Development

Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Event logs provide an effective means of improving system availability. However, the majority of faults produce many errors because faults propagate in the time and error detection domains. Thus, the ability to coalesce related events is critical. The tupling heuristics developed at Carnegie-Mellon University provide one such methodology. These heuristics were applied to a new and larger set of data in order to evaluate the generality of the scheme and to extend the previous work. The extensions included deriving a semantic understanding of why the rules work, expanded statistical analysis, and a comprehensive sensitivity study to determine the effects of changes in the rules. The results prove that tupling is a useful and general methodology. The sensitivity study enabled the identification of refinements to the rules, while the high degree of skew in the tuple variables enables us to propose that the extreme percentiles be used as an alarm threshold for proactive fault management.