CLUEBOX: a performance log analyzer for automated troubleshooting

Authors:
S. Ratna Sandeep;M. Swapna;Thirumale Niranjan;Sai Susarla;Siddhartha Nandi
Affiliations:
NetApp, Inc.;NetApp, Inc.;NetApp, Inc.;NetApp, Inc.;NetApp, Inc.
Venue:
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Year:
2008

Citing 15
Cited 2

Random Forests

Machine Learning
The Vision of Autonomic Computing

Computer
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Ensembles of Models for Automated Diagnosis of System Performance Problems

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Magpie: online modelling and performance-aware systems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
File system design for an NFS file server appliance

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Categorizing and differencing system behaviours

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Feature selection using principal feature analysis

Proceedings of the 15th international conference on Multimedia
Automated Rule-Based Diagnosis through a Distributed Monitor System

IEEE Transactions on Dependable and Secure Computing
Why did my pc suddenly slow down?

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Fingerpointing correlated failures in replicated systems

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques

A methodology to support load test analysis

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
Fmeter: extracting indexable low-level system signatures by counting kernel function calls

Proceedings of the 13th International Middleware Conference

Quantified Score

Hi-index	0.02

Visualization

Abstract

Performance problems in complex systems are often caused by under-provisioning, workload interference, incorrect expectations or bugs. Troubleshooting such systems is a difficult task faced by service engineers. We have built CLUEBOX, a non-intrusive toolkit that aids rapid problem diagnosis. It employs machine learning techniques on the available performance logs to characterize workloads, predict performance and discover anomalous behavior. By identifying the most relevant anomalies to focus on, CLUEBOX automates the most onerous aspects of performance troubleshooting. We have experimentally validated our methodology in a networked storage environment with real workloads. Using CLUEBOX to learn from a set of historical performance observations, we were able to distill over 2000 performance counters into 68 counters that succinctly describe a running workload. Further, we demonstrate effective troubleshooting of two scenarios that adversely impacted application response time: (1) an unknown competing workload, and (2) a file system consistency checker. By reducing the set of anomalous counters to examine to a dozen significant ones, CLUEBOX was able to guide a systems engineer towards identifying the correct root-cause rapidly.