Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Causality: models, reasoning, and inference
Causality: models, reasoning, and inference
IO-Lite: a unified I/O buffering and caching system
ACM Transactions on Computer Systems (TOCS)
SEDA: an architecture for well-conditioned, scalable internet services
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Ensembles of Models for Automated Diagnosis of System Performance Problems
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Using computers to diagnose computer problems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
High speed and robust event correlation
IEEE Communications Magazine
Detecting application-level failures in component-based Internet services
IEEE Transactions on Neural Networks
Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems
IEEE Transactions on Dependable and Secure Computing
Emergent (mis)behavior vs. complex software systems
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Automated known problem diagnosis with event traces
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Correlating multi-session attacks via replay
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Detecting performance anomalies in global applications
WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Towards fingerpointing in the Emulab dynamic distributed system
WORLDS'06 Proceedings of the 3rd conference on USENIX Workshop on Real, Large Distributed Systems - Volume 3
Exploiting nonstationarity for performance prediction
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
AutoBash: improving configuration management with operating system causality analysis
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems
IEEE Transactions on Knowledge and Data Engineering
Hardware counter driven on-the-fly request signatures
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
PDA: a tool for automated problem determination
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
SPIKE: best practice generation for storage area networks
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Fingerpointing correlated failures in replicated systems
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Anomaly-based fault detection in pervasive computing system
Proceedings of the 5th international conference on Pervasive services
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Diagnosing distributed systems with self-propelled instrumentation
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
iManage: policy-driven self-management for enterprise-scale systems
Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
IBMon: monitoring VMM-bypass capable InfiniBand devices using memory introspection
Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis
FAST '09 Proccedings of the 7th conference on File and storage technologies
Ranking the importance of alerts for problem determination in large computer systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
VCONF: a reinforcement learning approach to virtual machines auto-configuration
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Learning, indexing, and diagnosing network faults
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
NetPrints: diagnosing home network misconfigurations using shared knowledge
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Machine learning for on-line hardware reconfiguration
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Automated anomaly detection and performance modeling of enterprise applications
ACM Transactions on Computer Systems (TOCS)
EbAT: online methods for detecting utility cloud anomalies
Proceedings of the 6th Middleware Doctoral Symposium
Ganesha: blackBox diagnosis of MapReduce systems
ACM SIGMETRICS Performance Evaluation Review
Do you know your IQ?: a research agenda for information quality in systems
ACM SIGMETRICS Performance Evaluation Review
SherLog: error diagnosis by connecting clues from run-time logs
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
SelfTalk for Dena: query language and runtime support for evaluating system behavior
ACM SIGOPS Operating Systems Review
Fingerprinting the datacenter: automated classification of performance crises
Proceedings of the 5th European conference on Computer systems
Towards versatile performance models for complex, popular applications
ACM SIGMETRICS Performance Evaluation Review
Assessing operational impact in enterprise systems by mining usage patterns
DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
iManage: policy-driven self-management for enterprise-scale systems
MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
Probabilistic performance modeling of virtualized resource allocation
Proceedings of the 7th international conference on Autonomic computing
A query language for understanding component interactions in production systems
Proceedings of the 24th ACM International Conference on Supercomputing
A methodology to support load test analysis
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
Practical performance models for complex, popular applications
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A query language and runtime tool for evaluating behavior of multi-tier servers
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Automated analysis of load testing results
Proceedings of the 19th international symposium on Software testing and analysis
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
FaReS: Fair Resource Scheduling for VMM-Bypass InfiniBand Devices
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Black-box problem diagnosis in parallel file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A case for machine learning to optimize multicore performance
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Lightweight, high-resolution monitoring for troubleshooting production systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Empirical comparison of techniques for automated failure diagnosis
SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Scoped identifiers for efficient bit aligned logging
Proceedings of the Conference on Design, Automation and Test in Europe
Spatio-temporal patterns in network events
Proceedings of the 6th International COnference
Improving software diagnosability via log enhancement
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Correlating multi-session attacks via replay
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
ASDF: an automated, online framework for diagnosing performance problems
Architecting dependable systems VII
Automated control for elastic n-tier workloads based on empirical modeling
Proceedings of the 8th ACM international conference on Autonomic computing
A flexible architecture integrating monitoring and analytics for managing large-scale data centers
Proceedings of the 8th ACM international conference on Autonomic computing
Clustering performance anomalies in web applications based on root causes
Proceedings of the 8th ACM international conference on Autonomic computing
PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
BLR-D: applying bilinear logistic regression to factored diagnosis problems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Practical experiences with chronics discovery in large telecommunications systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Cost-Sensitive decision tree learning for forensic classification
ECML'06 Proceedings of the 17th European conference on Machine Learning
Practical experiences with chronics discovery in large telecommunications systems
ACM SIGOPS Operating Systems Review
BLR-D: applying bilinear logistic regression to factored diagnosis problems
ACM SIGOPS Operating Systems Review
Improving Software Diagnosability via Log Enhancement
ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Modeling virtualized applications using machine learning techniques
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Structured comparative analysis of systems logs to diagnose performance problems
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
The Journal of Supercomputing
Healing online service systems via mining historical issue repositories
Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
Proceedings of the 9th international conference on Autonomic computing
Light-weight black-box failure detection for distributed systems
Proceedings of the 2012 workshop on Management of big data systems
G-RCA: a generic root cause analysis platform for service quality management in large IP networks
IEEE/ACM Transactions on Networking (TON)
A framework to compute statistics of system parameters from very large trace files
ACM SIGOPS Operating Systems Review
Fmeter: extracting indexable low-level system signatures by counting kernel function calls
Proceedings of the 13th International Middleware Conference
Juggling the Jigsaw: towards automated problem inference from network trouble tickets
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual ``raw'' values of collected measurements is ineffective, leading us to a more sophisticated approach based on statistical modeling and inference. Our method requires only that the system's metric of merit (such as average transaction response time) as well as a collection of lower-level operational metrics be collected, as is done by existing commercial monitoring tools. Even if the traces have no annotations of prior diagnoses of observed incidents (as is typical), our technique successfully clusters system states corresponding to similar problems, allowing diagnosticians to identify recurring problems and to characterize the ``syndrome'' of a group of problems. We validate our approach on both synthetic traces and several weeks of production traces from a customer-facing geoplexed 24 x 7 system; in the latter case, our approach identified a recurring problem that had required extensive manual diagnosis, and also aided the operators in correcting a previous misdiagnosis of a different problem.