Ensembles of Models for Automated Diagnosis of System Performance Problems

  • Authors:
  • Steve Zhang;Ira Cohen;Julie Symons;Armando Fox

  • Affiliations:
  • Stanford University;Hewlett Packard Research Labs;Hewlett Packard Research Labs;Stanford University

  • Venue:
  • DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we explored [1] an approach for identifying which low-level system properties were correlated to high-level SLO violations (the metric attributionproblem). The approach is based on automatically inducing models from data using pattern recognition and probability modeling techniques. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding new models when existing ones do not accurately capture current system behavior. Using realistic workloads on an implemented prototype system, we show that the ensemble of models captures the performance behavior of the system accurately under changing workloads and conditions. We fuse information from the models in the ensemble to identify likely causes of the performance problem, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload. The cost of inducing new models and managing the ensembles is negligible, making our approach both immediately practical and theoretically appealing.