Short term performance forecasting in enterprise systems
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Why did my pc suddenly slow down?
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SPIKE: best practice generation for storage area networks
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Anomaly detection and diagnosis in grid environments
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Log summarization and anomaly detection for troubleshooting distributed systems
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
iManage: policy-driven self-management for enterprise-scale systems
Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis
FAST '09 Proccedings of the 7th conference on File and storage technologies
Self-correlating predictive information tracking for large-scale production systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
VCONF: a reinforcement learning approach to virtual machines auto-configuration
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Towards a middleware for configuring large-scale storage infrastructures
Proceedings of the 7th International Workshop on Middleware for Grids, Clouds and e-Science
Fingerprinting the datacenter: automated classification of performance crises
Proceedings of the 5th European conference on Computer systems
iManage: policy-driven self-management for enterprise-scale systems
MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
On the use of computational geometry to detect software faults at runtime
Proceedings of the 7th international conference on Autonomic computing
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
CLUEBOX: a performance log analyzer for automated troubleshooting
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Approximating passage time distributions in queueing models by Bayesian expansion
Performance Evaluation
Diagnosis of software failures using computational geometry
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
IT incident management by analyzing incident relations
ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing
G-RCA: a generic root cause analysis platform for service quality management in large IP networks
IEEE/ACM Transactions on Networking (TON)
A flexible elastic control plane for private clouds
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Performance optimization of deployed software-as-a-service applications
Journal of Systems and Software
Workload-aware anomaly detection for Web applications
Journal of Systems and Software
Hi-index | 0.00 |
Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we explored [1] an approach for identifying which low-level system properties were correlated to high-level SLO violations (the metric attributionproblem). The approach is based on automatically inducing models from data using pattern recognition and probability modeling techniques. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding new models when existing ones do not accurately capture current system behavior. Using realistic workloads on an implemented prototype system, we show that the ensemble of models captures the performance behavior of the system accurately under changing workloads and conditions. We fuse information from the models in the ensemble to identify likely causes of the performance problem, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload. The cost of inducing new models and managing the ensembles is negligible, making our approach both immediately practical and theoretically appealing.