Capturing, indexing, clustering, and retrieving system history

Authors:
Ira Cohen;Steve Zhang;Moises Goldszmidt;Julie Symons;Terence Kelly;Armando Fox
Affiliations:
Hewlett-Packard Laboratories, Palo Alto, CA;Stanford University, Palo Alto, CA;Hewlett-Packard Laboratories, Palo Alto, CA;Hewlett-Packard Laboratories, Palo Alto, CA;Hewlett-Packard Laboratories, Palo Alto, CA;Hewlett-Packard Laboratories, Palo Alto, CA
Venue:
Proceedings of the twentieth ACM symposium on Operating systems principles
Year:
2005

Citing 13
Cited 79

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Causality: models, reasoning, and inference

Causality: models, reasoning, and inference
IO-Lite: a unified I/O buffering and caching system

ACM Transactions on Computer Systems (TOCS)
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Ensembles of Models for Automated Diagnosis of System Performance Problems

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Using computers to diagnose computer problems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
High speed and robust event correlation

IEEE Communications Magazine
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks

Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems

IEEE Transactions on Dependable and Secure Computing
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Automated known problem diagnosis with event traces

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Correlating multi-session attacks via replay

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Detecting performance anomalies in global applications

WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Towards fingerpointing in the Emulab dynamic distributed system

WORLDS'06 Proceedings of the 3rd conference on USENIX Workshop on Real, Large Distributed Systems - Volume 3
Exploiting nonstationarity for performance prediction

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
AutoBash: improving configuration management with operating system causality analysis

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems

IEEE Transactions on Knowledge and Data Engineering
Hardware counter driven on-the-fly request signatures

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
PDA: a tool for automated problem determination

LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
SPIKE: best practice generation for storage area networks

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Fingerpointing correlated failures in replicated systems

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Anomaly-based fault detection in pervasive computing system

Proceedings of the 5th international conference on Pervasive services
Analysis of application heartbeats: learning structural and temporal features in time series data for identification of performance problems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Diagnosing distributed systems with self-propelled instrumentation

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
iManage: policy-driven self-management for enterprise-scale systems

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
IBMon: monitoring VMM-bypass capable InfiniBand devices using memory introspection

Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis

FAST '09 Proccedings of the 7th conference on File and storage technologies
Ranking the importance of alerts for problem determination in large computer systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
VCONF: a reinforcement learning approach to virtual machines auto-configuration

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Learning, indexing, and diagnosing network faults

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
NetPrints: diagnosing home network misconfigurations using shared knowledge

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Machine learning for on-line hardware reconfiguration

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Automated anomaly detection and performance modeling of enterprise applications

ACM Transactions on Computer Systems (TOCS)
EbAT: online methods for detecting utility cloud anomalies

Proceedings of the 6th Middleware Doctoral Symposium
Ganesha: blackBox diagnosis of MapReduce systems

ACM SIGMETRICS Performance Evaluation Review
Do you know your IQ?: a research agenda for information quality in systems

ACM SIGMETRICS Performance Evaluation Review
SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
SelfTalk for Dena: query language and runtime support for evaluating system behavior

ACM SIGOPS Operating Systems Review
Fingerprinting the datacenter: automated classification of performance crises

Proceedings of the 5th European conference on Computer systems
Towards versatile performance models for complex, popular applications

ACM SIGMETRICS Performance Evaluation Review
Assessing operational impact in enterprise systems by mining usage patterns

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
iManage: policy-driven self-management for enterprise-scale systems

MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
Probabilistic performance modeling of virtualized resource allocation

Proceedings of the 7th international conference on Autonomic computing
A query language for understanding component interactions in production systems

Proceedings of the 24th ACM International Conference on Supercomputing
A methodology to support load test analysis

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
Practical performance models for complex, popular applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A query language and runtime tool for evaluating behavior of multi-tier servers

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Automated analysis of load testing results

Proceedings of the 19th international symposium on Software testing and analysis
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
FaReS: Fair Resource Scheduling for VMM-Bypass InfiniBand Devices

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A case for machine learning to optimize multicore performance

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Lightweight, high-resolution monitoring for troubleshooting production systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Empirical comparison of techniques for automated failure diagnosis

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
HiLighter: automatically building robust signatures of performance behavior for small- and large-scale systems

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Scoped identifiers for efficient bit aligned logging

Proceedings of the Conference on Design, Automation and Test in Europe
Spatio-temporal patterns in network events

Proceedings of the 6th International COnference
Improving software diagnosability via log enhancement

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Correlating multi-session attacks via replay

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
ASDF: an automated, online framework for diagnosing performance problems

Architecting dependable systems VII
Automated control for elastic n-tier workloads based on empirical modeling

Proceedings of the 8th ACM international conference on Autonomic computing
A flexible architecture integrating monitoring and analytics for managing large-scale data centers

Proceedings of the 8th ACM international conference on Autonomic computing
Clustering performance anomalies in web applications based on root causes

Proceedings of the 8th ACM international conference on Autonomic computing
Ranking the importance of alerts for problem determination in large computer systems

Cluster Computing
PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
BLR-D: applying bilinear logistic regression to factored diagnosis problems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Practical experiences with chronics discovery in large telecommunications systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Cost-Sensitive decision tree learning for forensic classification

ECML'06 Proceedings of the 17th European conference on Machine Learning
Practical experiences with chronics discovery in large telecommunications systems

ACM SIGOPS Operating Systems Review
BLR-D: applying bilinear logistic regression to factored diagnosis problems

ACM SIGOPS Operating Systems Review
Improving Software Diagnosability via Log Enhancement

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Modeling virtualized applications using machine learning techniques

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Data flow analysis for anomaly detection and identification toward resiliency in extreme scale systems

The Journal of Supercomputing
Healing online service systems via mining historical issue repositories

Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems

Proceedings of the 9th international conference on Autonomic computing
Light-weight black-box failure detection for distributed systems

Proceedings of the 2012 workshop on Management of big data systems
G-RCA: a generic root cause analysis platform for service quality management in large IP networks

IEEE/ACM Transactions on Networking (TON)
A framework to compute statistics of system parameters from very large trace files

ACM SIGOPS Operating Systems Review
Fmeter: extracting indexable low-level system signatures by counting kernel function calls

Proceedings of the 13th International Middleware Conference
Juggling the Jigsaw: towards automated problem inference from network trouble tickets

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual ``raw'' values of collected measurements is ineffective, leading us to a more sophisticated approach based on statistical modeling and inference. Our method requires only that the system's metric of merit (such as average transaction response time) as well as a collection of lower-level operational metrics be collected, as is done by existing commercial monitoring tools. Even if the traces have no annotations of prior diagnoses of observed incidents (as is typical), our technique successfully clusters system states corresponding to similar problems, allowing diagnosticians to identify recurring problems and to characterize the ``syndrome'' of a group of problems. We validate our approach on both synthetic traces and several weeks of production traces from a customer-facing geoplexed 24 x 7 system; in the latter case, our approach identified a recurring problem that had required extensive manual diagnosis, and also aided the operators in correcting a previous misdiagnosis of a different problem.