Ensembles of Models for Automated Diagnosis of System Performance Problems

Authors:
Steve Zhang;Ira Cohen;Julie Symons;Armando Fox
Affiliations:
Stanford University;Hewlett Packard Research Labs;Hewlett Packard Research Labs;Stanford University
Venue:
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Year:
2005

Citing 0
Cited 26

Short term performance forecasting in enterprise systems

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Three research challenges at the intersection of machine learning, statistical induction, and systems

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
SearchGen: a synthetic workload generator for scientific literature digital libraries and search engines

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Why did my pc suddenly slow down?

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SPIKE: best practice generation for storage area networks

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Anomaly detection and diagnosis in grid environments

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Log summarization and anomaly detection for troubleshooting distributed systems

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
iManage: policy-driven self-management for enterprise-scale systems

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis

FAST '09 Proccedings of the 7th conference on File and storage technologies
Self-correlating predictive information tracking for large-scale production systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
VCONF: a reinforcement learning approach to virtual machines auto-configuration

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Towards a middleware for configuring large-scale storage infrastructures

Proceedings of the 7th International Workshop on Middleware for Grids, Clouds and e-Science
Fingerprinting the datacenter: automated classification of performance crises

Proceedings of the 5th European conference on Computer systems
iManage: policy-driven self-management for enterprise-scale systems

MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
On the use of computational geometry to detect software faults at runtime

Proceedings of the 7th international conference on Autonomic computing
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
CLUEBOX: a performance log analyzer for automated troubleshooting

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Approximating passage time distributions in queueing models by Bayesian expansion

Performance Evaluation
Diagnosis of software failures using computational geometry

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
IT incident management by analyzing incident relations

ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing
G-RCA: a generic root cause analysis platform for service quality management in large IP networks

IEEE/ACM Transactions on Networking (TON)
A flexible elastic control plane for private clouds

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Performance optimization of deployed software-as-a-service applications

Journal of Systems and Software
Workload-aware anomaly detection for Web applications

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we explored [1] an approach for identifying which low-level system properties were correlated to high-level SLO violations (the metric attributionproblem). The approach is based on automatically inducing models from data using pattern recognition and probability modeling techniques. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding new models when existing ones do not accurately capture current system behavior. Using realistic workloads on an implemented prototype system, we show that the ensemble of models captures the performance behavior of the system accurately under changing workloads and conditions. We fuse information from the models in the ensemble to identify likely causes of the performance problem, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload. The cost of inducing new models and managing the ensembles is negligible, making our approach both immediately practical and theoretically appealing.