HOLMES: an event-driven solution to monitor data centers through continuous queries and machine learning

Authors:
Pedro Henriques dos Santos Teixeira;Ricardo Gomes Clemente;Ronald Andreu Kaiser;Denis Almeida Vieira, Jr.
Affiliations:
INTELIE Research Lab, Rio de Janeiro, RJ, Brazil;INTELIE Research Lab, Rio de Janeiro, RJ, Brazil;INTELIE Research Lab, Rio de Janeiro, RJ, Brazil;Rede Record, São Paulo, SP, Brazil
Venue:
Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
Year:
2010

Citing 9
Cited 1

Real-time telecommunication network management: extending event correlation with temporal constraints

Proceedings of the fourth international symposium on Integrated network management IV
Network management with Nagios

Linux Journal
Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions
InteMon: continuous mining of sensor data in large-scale self-infrastructures

ACM SIGOPS Operating Systems Review
Disk aware discord discovery: finding unusual time series in terabyte sized datasets

Knowledge and Information Systems
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
Event-based applications and enabling technologies

Proceedings of the Third ACM International Conference on Distributed Event-Based Systems
A graphical editor for complex event pattern generation

Proceedings of the Third ACM International Conference on Distributed Event-Based Systems
Data stream anomaly detection through principal subspace tracking

Proceedings of the 2010 ACM Symposium on Applied Computing

Predictive complex event processing: a conceptual framework for combining complex event processing and predictive analytics

Proceedings of the Fifth Balkan Conference in Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Supervisory processes are fundamental when running data center operations striving for fault resilience: any downtime can directly affect the business's income and definitely its reputation. Current monitoring tools rely on experts to configure constant thresholds on single streams, which is not appropriated for dynamic systems and insufficient to capture complex patterns. We present HOLMES, built to support data center experts to anticipate failures with a solution that combines Event Driven Architecture, Complex Event Processing and an unsupervised machine learning algorithm. Based on rules created by the users, the system continuously checks for known problems. Meanwhile, for the unknown ones, we leverage the CEP engine for aggregating and joining streams of real-time data to feed normalized input to FRAHST, our machine learning algorithm that detects anomalous patterns across multivariate numerical streams. We describe how the UI module also operates within the publish/subscribe paradigm to enhance situational awareness. The system had very well acceptance and was successfully implemented at one of the largest Internet Service Providers in South America.