Goodness-of-fit techniques
The Effect of Program Behavior on Fault Observability
IEEE Transactions on Computers
Modelling extremal events: for insurance and finance
Modelling extremal events: for insurance and finance
Automated support for classifying software failure reports
Proceedings of the 25th International Conference on Software Engineering
Automating Software Failure Reporting
Queue - System Failures
On-line anomaly detection of deployed software: a statistical machine learning approach
Proceedings of the 3rd international workshop on Software quality assurance
Hi-index | 0.00 |
As the capacity of hardware systems has grown and workload consolidation has taken place, the volume of performance metrics and diagnostic data streams has outscaled the capability of people to handle these systems using traditional methods. As work of different types (such as database, batch, and Web processing), each in its own monitoring silo, runs concurrently on a single image (operating system instance), both the complexity and the business consequences of a single image failure have increased. This paper presents two techniques for generating actionable information out of the overwhelming amount of performance and diagnostic data available to human analysts. Failure scoring is used to identify high-risk failure events that may be obscured in the myriad system events. This replaces human expertise in scanning tens of thousands of records per day and results in a short, prioritized list for action by systems staff. Adaptive thresholding is used to drive predictive and descriptive machine-learning-based modeling to isolate and identify misbehaving processes and transactions. The attraction of this technique is that it does not require human intervention and can be reapplied continually, resulting in models that are not brittle. Both techniques reduce the quantity and increase the relevance of data available for programmatic and human processes.