System monitoring with metric-correlation models: problems and solutions

  • Authors:
  • Miao Jiang;Mohammad A. Munawar;Thomas Reidemeister;Paul A.S. Ward

  • Affiliations:
  • University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada

  • Venue:
  • ICAC '09 Proceedings of the 6th international conference on Autonomic computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Correlations among management metrics in software systems allow errors to be detected and their cause localized. Prior research shows that linear models can capture many of these correlations. However, our research shows that several factors may prevent linear models from accurately describing correlations, even if the underlying relationship is linear. Two common phenomena we have observed are relationships that evolve, typically with time, and heterogeneous variance of the correlated metrics. Two-variable linear models proposed thus far fail to capture these phenomena, and thus fail to describe system dynamics correctly. Often, these phenomena are caused by a missing variable. However, searching for three-variable correlations is O(n3) for n metrics, which is costly for systems with many metrics. In this paper we address the above challenges by improving on two-variable Ordinary Least Squares regression models. We validate our models using a realistic Java-Enterprise-Edition application. Using fault-injection experiments we show that our improved models capture system behavior accurately. We detect errors within 8 sample periods on average from the injection of the fault, which is less than half the time required by the current linear-model approach.