CLUEBOX: a performance log analyzer for automated troubleshooting

  • Authors:
  • S. Ratna Sandeep;M. Swapna;Thirumale Niranjan;Sai Susarla;Siddhartha Nandi

  • Affiliations:
  • NetApp, Inc.;NetApp, Inc.;NetApp, Inc.;NetApp, Inc.;NetApp, Inc.

  • Venue:
  • WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
  • Year:
  • 2008

Quantified Score

Hi-index 0.02

Visualization

Abstract

Performance problems in complex systems are often caused by under-provisioning, workload interference, incorrect expectations or bugs. Troubleshooting such systems is a difficult task faced by service engineers. We have built CLUEBOX, a non-intrusive toolkit that aids rapid problem diagnosis. It employs machine learning techniques on the available performance logs to characterize workloads, predict performance and discover anomalous behavior. By identifying the most relevant anomalies to focus on, CLUEBOX automates the most onerous aspects of performance troubleshooting. We have experimentally validated our methodology in a networked storage environment with real workloads. Using CLUEBOX to learn from a set of historical performance observations, we were able to distill over 2000 performance counters into 68 counters that succinctly describe a running workload. Further, we demonstrate effective troubleshooting of two scenarios that adversely impacted application response time: (1) an unknown competing workload, and (2) a file system consistency checker. By reducing the set of anomalous counters to examine to a dozen significant ones, CLUEBOX was able to guide a systems engineer towards identifying the correct root-cause rapidly.