Correlating instrumentation data to system states: a building block for automated diagnosis and control

Authors:
Ira Cohen;Moises Goldszmidt;Terence Kelly;Julie Symons;Jeffrey S. Chase
Affiliations:
HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;Department of Computer Science, Duke University, Durham, NC
Venue:
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Year:
2004

Citing 24
Cited 122

Sun performance and tuning: SPARC & Solaris

Sun performance and tuning: SPARC & Solaris
Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Machine Learning
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Adaptive Probabilistic Networks with Hidden Variables

Machine Learning - Special issue on learning with probabilistic representations
httperf—a tool for measuring web server performance

ACM SIGMETRICS Performance Evaluation Review
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Minerva: An automated resource provisioning tool for large-scale storage systems

ACM Transactions on Computer Systems (TOCS)
Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach

IEEE Transactions on Parallel and Distributed Systems
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Web transaction analysis and optimization (TAO)

WOSP '02 Proceedings of the 3rd international workshop on Software and performance
The Vision of Autonomic Computing

Computer
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining

ACM Transactions on Computer Systems (TOCS)
A knowledge plane for the internet

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Grid Information Services for Distributed Resource Sharing

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
File Classification in Self-* Storage Systems

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Magpie: online modelling and performance-aware systems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Model-based resource provisioning in a web service utility

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Querying the internet with PIER

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Automating computer bottleneck detection with belief nets

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Sequential update of Bayesian network structure

UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence

Short term performance forecasting in enterprise systems

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Effective web service load balancing through statistical monitoring

Communications of the ACM - Self managed systems
Controllable fair queuing for meeting performance goals

Performance Evaluation - Performance 2005
A supervised learning approach for routing optimizations in wireless sensor networks

REALMAN '06 Proceedings of the 2nd international workshop on Multi-hop ad hoc networks: from theory to reality
Challenges in managing dependable data systems

ACM SIGMETRICS Performance Evaluation Review - Design, implementation, and performance of storage systems
Stardust: tracking activity in a distributed storage system

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Understanding the management of client perceived response time

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Proactive identification of performance problems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Mining web logs to debug distant connectivity problems

Proceedings of the 2006 SIGCOMM workshop on Mining network data
Concurrency control in computer services using adaptive optimal control

MIC'06 Proceedings of the 25th IASTED international conference on Modeling, indentification, and control
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Automated known problem diagnosis with event traces

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Analytic modeling of multitier Internet applications

ACM Transactions on the Web (TWEB)
Performance problem localization in self-healing, service-oriented systems using Bayesian networks

Proceedings of the 2007 ACM symposium on Applied computing
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Zodiac: efficient impact analysis for storage area networks

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Three research challenges at the intersection of machine learning, statistical induction, and systems

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Falling off the cliff: when systems go nonlinear

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Detecting performance anomalies in global applications

WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
JIT instrumentation: a novel approach to dynamically instrument operating systems

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Exploiting nonstationarity for performance prediction

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Event summarization for system management

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Performance impacts of autocorrelated flows in multi-tiered systems

Performance Evaluation
Towards an autonomic computing testbed

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Categorizing and differencing system behaviours

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Enabling policy-driven self-management for enterprise-scale systems

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Observer: keeping system models from becoming obsolete

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Mulini: an automated staging framework for QoS of distributed multi-tier applications

Proceedings of the 2007 workshop on Automating service quality: Held at the International Conference on Automated Software Engineering (ASE)
Predicting link quality using supervised learning in wireless sensor networks

ACM SIGMOBILE Mobile Computing and Communications Review
Adaptive quality of service management for enterprise services

ACM Transactions on the Web (TWEB)
Agile dynamic provisioning of multi-tier Internet applications

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Processor hardware counter statistics as a first-class system resource

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
Why did my pc suddenly slow down?

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SPIKE: best practice generation for storage area networks

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Monitoring multi-tier clustered systems with invariant metric relationships

Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems
Ironmodel: robust performance models in the wild
Causal analysis for performance modeling of computer programs

Scientific Programming
Cataclysm: Scalable overload policing for internet applications

Journal of Network and Computer Applications
Analysis of application heartbeats: learning structural and temporal features in time series data for identification of performance problems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis

DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
Performance profiling with EndoScope, an acquisitional software monitoring framework

Proceedings of the VLDB Endowment
Profiling services for resource optimization and capacity planning in distributed systems

Cluster Computing
Resource overbooking and application profiling in a shared Internet hosting platform

ACM Transactions on Internet Technology (TOIT)
Utility-driven proactive management of availability in enterprise-scale information flows

Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
iManage: policy-driven self-management for enterprise-scale systems

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis

FAST '09 Proccedings of the 7th conference on File and storage technologies
Configuration-space performance anomaly depiction

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
System monitoring with metric-correlation models: problems and solutions

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
NAP: a building block for remediating performance bottlenecks via black box network analysis

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
NetPrints: diagnosing home network misconfigurations using shared knowledge

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
An intelligent Quality of Service brokering model for e-commerce

Expert Systems with Applications: An International Journal
Performance management via adaptive thresholds with separate control of false positive and false negative errors

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Heteroscedastic models to track relationships between management metrics

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Performance debugging in data centers: doing more with less

COMSNETS'09 Proceedings of the First international conference on COMmunication Systems And NETworks
Reduced dimension control based on online recursive principal component analysis

ACC'09 Proceedings of the 2009 conference on American Control Conference
Do you know your IQ?: a research agenda for information quality in systems

ACM SIGMETRICS Performance Evaluation Review
SelfTalk for Dena: query language and runtime support for evaluating system behavior

ACM SIGOPS Operating Systems Review
Fingerprinting the datacenter: automated classification of performance crises

Proceedings of the 5th European conference on Computer systems
SNTS: sensor network troubleshooting suite

DCOSS'07 Proceedings of the 3rd IEEE international conference on Distributed computing in sensor systems
Towards versatile performance models for complex, popular applications

ACM SIGMETRICS Performance Evaluation Review
CloudXplor: a tool for configuration planning in clouds based on empirical data

Proceedings of the 2010 ACM Symposium on Applied Computing
Bottleneck detection using statistical intervention analysis

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
iManage: policy-driven self-management for enterprise-scale systems

MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
On the use of computational geometry to detect software faults at runtime

Proceedings of the 7th international conference on Autonomic computing
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems

Proceedings of the 7th international conference on Autonomic computing
Autonomic policy adaptation using decentralized online clustering

Proceedings of the 7th international conference on Autonomic computing
A methodology to support load test analysis

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
Practical performance models for complex, popular applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Automated debugging of SLO violations in enterprise systems

COMSNETS'10 Proceedings of the 2nd international conference on COMmunication systems and NETworks
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Using virtualization for high availability and disaster recovery

IBM Journal of Research and Development
Automated experiment-driven management of (database) systems

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Lightweight, high-resolution monitoring for troubleshooting production systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
CLUEBOX: a performance log analyzer for automated troubleshooting

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Empirical comparison of techniques for automated failure diagnosis

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Detecting user-visible failures in AJAX web applications by analyzing users' interaction behaviors

Proceedings of the IEEE/ACM international conference on Automated software engineering
Diagnosing mobile applications in the wild

Hotnets-IX Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks
Application classification through monitoring and learning of resource consumption patterns

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
NEVERMIND, the problem is already fixed: proactively detecting and troubleshooting customer DSL problems

Proceedings of the 6th International COnference
Automating configuration troubleshooting with dynamic information flow analysis

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Analyzing web logs to detect user-visible failures

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Blink: managing server clusters on intermittent power

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Leveraging many simple statistical models to adaptively monitor software systems

International Journal of High Performance Computing and Networking
A root cause localization model for large scale systems

HotDep'05 Proceedings of the First conference on Hot topics in system dependability
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Towards 'integrated' monitoring and management of DataCenters using complex event processing techniques

COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
More intervention now!

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
OLIC: online information compression for scalable hosting infrastructure monitoring

Proceedings of the Nineteenth International Workshop on Quality of Service
Automated control for elastic n-tier workloads based on empirical modeling

Proceedings of the 8th ACM international conference on Autonomic computing
Analyzing IPTV set-top box crashes

Proceedings of the 2nd ACM SIGCOMM workshop on Home networks
Large-scale app-based reporting of customer problems in cellular networks: potential and limitations

Proceedings of the first ACM SIGCOMM workshop on Measurements up the stack
PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Q-score: proactive service quality assessment in a large IPTV system

Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
Session management of correlated multi-stream 3D tele-immersive environments

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Detecting bottleneck in -tier IT applications through analysis

DSOM'06 Proceedings of the 17th IFIP/IEEE international conference on Distributed Systems: operations and management
Utility-driven proactive management of availability in enterprise-scale information flows

Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
Modeling virtualized applications using machine learning techniques

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Modellus: Automated modeling of complex internet data center applications

ACM Transactions on the Web (TWEB)
Automated detection of performance regressions using statistical process control techniques

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Diagnosis of software failures using computational geometry

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
An autonomic framework for enhancing the quality of data grid services

Future Generation Computer Systems
DAPA: diagnosing application performance anomalies for virtualized infrastructures

Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Understanding and detecting real-world performance bugs

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems

Proceedings of the 9th international conference on Autonomic computing
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Leveraging many simple statistical models to adaptively monitor software systems

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
G-RCA: a generic root cause analysis platform for service quality management in large IP networks

IEEE/ACM Transactions on Networking (TON)
A framework to compute statistics of system parameters from very large trace files

ACM SIGOPS Operating Systems Review
Fmeter: extracting indexable low-level system signatures by counting kernel function calls

Proceedings of the 13th International Middleware Conference
vPerfGuard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Towards detecting software performance anti-patterns using classification techniques

ACM SIGSOFT Software Engineering Notes
Workload-aware anomaly detection for Web applications

Journal of Systems and Software
NetCheck: network diagnoses from blackbox traces

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies the use of statistical induction techniques as a basis for automated performance diagnosis and performance management. The goal of the work is to develop and evaluate tools for offline and online analysis of system metrics gathered from instrumentation in Internet server platforms. We use a promising class of probabilistic models (Tree-Augmented Bayesian Networks or TANs) to identify combinations of system-level metrics and threshold values that correlate with high-level performance states--compliance with Service Level Objectives (SLOs) for average-case response time--in a three-tier Web service under a variety of conditions. Experimental results from a testbed show that TAN models involving small subsets of metrics capture patterns of performance behavior in a way that is accurate and yields insights into the causes of observed performance effects. TANs are extremely efficient to represent and evaluate, and they have interpretability properties that make them excellent candidates for automated diagnosis and control. We explore the use of TAN models for offline forensic diagnosis, and in a limited online setting for performance forecasting with stable workloads.