Detecting performance anomalies in global applications

  • Authors:
  • Terence Kelly

  • Affiliations:
  • Hewlett-Packard Laboratories, Palo Alto, CA

  • Venue:
  • WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Understanding real, large distributed systems can be as difficult and important as building them. Complex modern applications that span geographic and organizational boundaries confound performance analysis in challenging new ways. These systems clearly demand new analytic methods, but we are wary of approaches that suffer from the same problems as the systems themselves (e.g., complexity and opacity). This paper shows how to obtain valuable insight into the performance of globally-distributed applications without abstruse techniques or detailed application knowledge: Simple queueing-theoretic observations together with standard optimization methods yield remarkably accurate performance models. The models can be used for performance anomaly detection, i.e., distinguishing performance faults from mere overload. This distinction can in turn suggest both performance debugging tools and remedial measures. Extensive empirical results from three production systems serving real customers--two of which are globally distributed and span administrative domains-- demonstrate that our method yields accurate performance models of diverse applications. Our method furthermore flagged as anomalous an episode of a real performance bug in one of the three systems.