HOLMES: an event-driven solution to monitor data centers through continuous queries and machine learning

  • Authors:
  • Pedro Henriques dos Santos Teixeira;Ricardo Gomes Clemente;Ronald Andreu Kaiser;Denis Almeida Vieira, Jr.

  • Affiliations:
  • INTELIE Research Lab, Rio de Janeiro, RJ, Brazil;INTELIE Research Lab, Rio de Janeiro, RJ, Brazil;INTELIE Research Lab, Rio de Janeiro, RJ, Brazil;Rede Record, São Paulo, SP, Brazil

  • Venue:
  • Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Supervisory processes are fundamental when running data center operations striving for fault resilience: any downtime can directly affect the business's income and definitely its reputation. Current monitoring tools rely on experts to configure constant thresholds on single streams, which is not appropriated for dynamic systems and insufficient to capture complex patterns. We present HOLMES, built to support data center experts to anticipate failures with a solution that combines Event Driven Architecture, Complex Event Processing and an unsupervised machine learning algorithm. Based on rules created by the users, the system continuously checks for known problems. Meanwhile, for the unknown ones, we leverage the CEP engine for aggregating and joining streams of real-time data to feed normalized input to FRAHST, our machine learning algorithm that detects anomalous patterns across multivariate numerical streams. We describe how the UI module also operates within the publish/subscribe paradigm to enhance situational awareness. The system had very well acceptance and was successfully implemented at one of the largest Internet Service Providers in South America.