Ranking the importance of alerts for problem determination in large computer systems

Authors:
Guofei Jiang;Haifeng Chen;Kenji Yoshihira;Akhilesh Saxena
Affiliations:
NEC Laboratories America, Princeton, NJ, USA;NEC Laboratories America, Princeton, NJ, USA;NEC Laboratories America, Princeton, NJ, USA;NEC Laboratories America, Princeton, NJ, USA
Venue:
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Year:
2009

Citing 12
Cited 6

System identification: theory for the user

System identification: theory for the user
An alarm filtering algorithm for optical communication networks

Proceedings of the IEEE/IFIP TC6/WG6.4/WG6.6 International Conference on Management of Multimedia Networks and Services
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A Simple Way to Estimate the Cost of Downtime

LISA '02 Proceedings of the 16th USENIX conference on System administration
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems

IEEE Transactions on Dependable and Secure Computing
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems

IEEE Transactions on Knowledge and Data Engineering
Discovering Likely Invariants of Distributed Transaction Systems for Autonomic System Management

ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
High speed and robust event correlation

IEEE Communications Magazine

Monalytics: online monitoring and analytics for managing large scale data centers

Proceedings of the 7th international conference on Autonomic computing
Autonomic policy adaptation using decentralized online clustering

Proceedings of the 7th international conference on Autonomic computing
Towards 'integrated' monitoring and management of DataCenters using complex event processing techniques

COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
Session management of correlated multi-stream 3D tele-immersive environments

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Light-weight black-box failure detection for distributed systems

Proceedings of the 2012 workshop on Management of big data systems
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

The complexity of large computer systems has raised unprecedented challenges for system management. In practice, operators often collect large volume of monitoring data from system components and set up many rules to check data and trigger alerts. However, the alerts from various rules usually have different problem reporting accuracy because their thresholds are often manually set based on operators' experience and intuition. Meantime, due to system dependencies, a single problem may trigger many alerts at the same time in large systems and the critical question is which alert should be analyzed first in the following problem determination process. In this paper, we propose a novel peer review mechanism to rank the importance of alerts and the top ranked alerts are more likely to be true positives. After comparing a metric value against its threshold to generate alerts, we also compare the value with the equivalent thresholds from many other rules to determine the importance of alerts. Our approach is evaluated with a real test bed system and experimental results are also included to demonstrate its effectiveness.