Scoring and thresholding for availability

Authors:
S. Heisig;J. R. M. Hosking
Affiliations:
IBM Research Division, Thomas J. Watson Research Center, Hawthorne, NY;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY
Venue:
IBM Systems Journal
Year:
2008

Citing 7
Cited 0

Goodness-of-fit techniques

Goodness-of-fit techniques
Parameter and quantile estimation for the generalized pareto distribution

Technometrics
The Effect of Program Behavior on Fault Observability

IEEE Transactions on Computers
Modelling extremal events: for insurance and finance

Modelling extremal events: for insurance and finance
Automated support for classifying software failure reports

Proceedings of the 25th International Conference on Software Engineering
Automating Software Failure Reporting

Queue - System Failures
On-line anomaly detection of deployed software: a statistical machine learning approach

Proceedings of the 3rd international workshop on Software quality assurance

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the capacity of hardware systems has grown and workload consolidation has taken place, the volume of performance metrics and diagnostic data streams has outscaled the capability of people to handle these systems using traditional methods. As work of different types (such as database, batch, and Web processing), each in its own monitoring silo, runs concurrently on a single image (operating system instance), both the complexity and the business consequences of a single image failure have increased. This paper presents two techniques for generating actionable information out of the overwhelming amount of performance and diagnostic data available to human analysts. Failure scoring is used to identify high-risk failure events that may be obscured in the myriad system events. This replaces human expertise in scanning tens of thousands of records per day and results in a short, prioritized list for action by systems staff. Adaptive thresholding is used to drive predictive and descriptive machine-learning-based modeling to isolate and identify misbehaving processes and transactions. The attraction of this technique is that it does not require human intervention and can be reapplied continually, resulting in models that are not brittle. Both techniques reduce the quantity and increase the relevance of data available for programmatic and human processes.