Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Ensembles of Models for Automated Diagnosis of System Performance Problems
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
Experiences with Pip: finding unexpected behavior in distributed systems
Proceedings of the twentieth ACM symposium on Operating systems principles
Automated known problem diagnosis with event traces
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using computers to diagnose computer problems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression
The Journal of Machine Learning Research
Fingerpointing correlated failures in replicated systems
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Guided Problem Diagnosis through Active Learning
ICAC '08 Proceedings of the 2008 International Conference on Autonomic Computing
Debugging in the (very) large: ten years of implementation and experience
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams
SIAM Journal on Computing
Advanced tools for operators at amazon.com
HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing
High speed and robust event correlation
IEEE Communications Magazine
Profiling network performance for multi-tier data center applications
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Clustering performance anomalies in web applications based on root causes
Proceedings of the 8th ACM international conference on Autonomic computing
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Practical experiences with chronics discovery in large telecommunications systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Design implications for enterprise storage systems via multi-dimensional trace analysis
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Bootstrapping energy debugging on smartphones: a first look at energy bugs in mobile devices
Proceedings of the 10th ACM Workshop on Hot Topics in Networks
Session management of correlated multi-stream 3D tele-immersive environments
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Practical experiences with chronics discovery in large telecommunications systems
ACM SIGOPS Operating Systems Review
DejaVu: accelerating resource allocation in virtualized environments
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Modeling virtualized applications using machine learning techniques
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
DAPA: diagnosing application performance anomalies for virtualized infrastructures
Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Light-weight black-box failure detection for distributed systems
Proceedings of the 2012 workshop on Management of big data systems
Collaborative energy debugging for mobile devices
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Fmeter: extracting indexable low-level system signatures by counting kernel function calls
Proceedings of the 13th International Middleware Conference
VScope: middleware for troubleshooting time-sensitive data center applications
Proceedings of the 13th International Middleware Conference
Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
An online service-oriented performance profiling tool for cloud computing systems
Frontiers of Computer Science: Selected Publications from Chinese Universities
Carat: collaborative energy diagnosis for mobile devices
Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Making problem diagnosiswork for large-scale, production storage systems
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Towards detecting software performance anti-patterns using classification techniques
ACM SIGSOFT Software Engineering Notes
Hi-index | 0.00 |
Contemporary datacenters comprise hundreds or thousands of machines running applications requiring high availability and responsiveness. Although a performance crisis is easily detected by monitoring key end-to-end performance indicators (KPIs) such as response latency or request throughput, the variety of conditions that can lead to KPI degradation makes it difficult to select appropriate recovery actions. We propose and evaluate a methodology for automatic classification and identification of crises, and in particular for detecting whether a given crisis has been seen before, so that a known solution may be immediately applied. Our approach is based on a new and efficient representation of the datacenter's state called a fingerprint, constructed by statistical selection and summarization of the hundreds of performance metrics typically collected on such systems. Our evaluation uses 4 months of trouble-ticket data from a production datacenter with hundreds of machines running a 24x7 enterprise-class user-facing application. In experiments in a realistic and rigorous operational setting, our approach provides operators the information necessary to initiate recovery actions with 80% correctness in an average of 10 minutes, which is 50 minutes earlier than the deadline provided to us by the operators. To the best of our knowledge this is the first rigorous evaluation of any such approach on a large-scale production installation.