Fingerprinting the datacenter: automated classification of performance crises

Authors:
Peter Bodik;Moises Goldszmidt;Armando Fox;Dawn B. Woodard;Hans Andersen
Affiliations:
UC Berkeley, Berkeley, CA, USA;Microsoft Research, Mountain View, CA, USA;UC Berkeley, Berkeley, CA, USA;Cornell University, Ithaca, NY, USA;Microsoft, Redmond, WA, USA
Venue:
Proceedings of the 5th European conference on Computer systems
Year:
2010

Citing 17
Cited 22

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Ensembles of Models for Automated Diagnosis of System Performance Problems

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Experiences with Pip: finding unexpected behavior in distributed systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Automated known problem diagnosis with event traces

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using computers to diagnose computer problems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Three research challenges at the intersection of machine learning, statistical induction, and systems

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression

The Journal of Machine Learning Research
Fingerpointing correlated failures in replicated systems

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Guided Problem Diagnosis through Active Learning

ICAC '08 Proceedings of the 2008 International Conference on Autonomic Computing
Debugging in the (very) large: ten years of implementation and experience

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams

SIAM Journal on Computing
Advanced tools for operators at amazon.com

HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing
High speed and robust event correlation

IEEE Communications Magazine

Profiling network performance for multi-tier data center applications

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Clustering performance anomalies in web applications based on root causes

Proceedings of the 8th ACM international conference on Autonomic computing
Polygraph: system for dynamic reduction of false alerts in large-scale it service delivery environments

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Practical experiences with chronics discovery in large telecommunications systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Design implications for enterprise storage systems via multi-dimensional trace analysis

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Bootstrapping energy debugging on smartphones: a first look at energy bugs in mobile devices

Proceedings of the 10th ACM Workshop on Hot Topics in Networks
Session management of correlated multi-stream 3D tele-immersive environments

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Practical experiences with chronics discovery in large telecommunications systems

ACM SIGOPS Operating Systems Review
DejaVu: accelerating resource allocation in virtualized environments

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Modeling virtualized applications using machine learning techniques

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
DAPA: diagnosing application performance anomalies for virtualized infrastructures

Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Light-weight black-box failure detection for distributed systems

Proceedings of the 2012 workshop on Management of big data systems
Collaborative energy debugging for mobile devices

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Fmeter: extracting indexable low-level system signatures by counting kernel function calls

Proceedings of the 13th International Middleware Conference
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference
vPerfGuard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
An online service-oriented performance profiling tool for cloud computing systems

Frontiers of Computer Science: Selected Publications from Chinese Universities
Carat: collaborative energy diagnosis for mobile devices

Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
Audit games

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Making problem diagnosiswork for large-scale, production storage systems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Towards detecting software performance anti-patterns using classification techniques

ACM SIGSOFT Software Engineering Notes

Quantified Score

Hi-index	0.00

Visualization

Abstract

Contemporary datacenters comprise hundreds or thousands of machines running applications requiring high availability and responsiveness. Although a performance crisis is easily detected by monitoring key end-to-end performance indicators (KPIs) such as response latency or request throughput, the variety of conditions that can lead to KPI degradation makes it difficult to select appropriate recovery actions. We propose and evaluate a methodology for automatic classification and identification of crises, and in particular for detecting whether a given crisis has been seen before, so that a known solution may be immediately applied. Our approach is based on a new and efficient representation of the datacenter's state called a fingerprint, constructed by statistical selection and summarization of the hundreds of performance metrics typically collected on such systems. Our evaluation uses 4 months of trouble-ticket data from a production datacenter with hundreds of machines running a 24x7 enterprise-class user-facing application. In experiments in a realistic and rigorous operational setting, our approach provides operators the information necessary to initiate recovery actions with 80% correctness in an average of 10 minutes, which is 50 minutes earlier than the deadline provided to us by the operators. To the best of our knowledge this is the first rigorous evaluation of any such approach on a large-scale production installation.