Detecting performance anomalies in global applications

Authors:
Terence Kelly
Affiliations:
Hewlett-Packard Laboratories, Palo Alto, CA
Venue:
WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Year:
2005

Citing 14
Cited 19

Characterizing the scalability of a large web-based shopping system

ACM Transactions on Internet Technology (TOIT)
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Eigenspace-based anomaly detection in computer systems

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Aberrant Behavior Detection in Time Series for Network Monitoring

LISA '00 Proceedings of the 14th USENIX conference on System administration
Queueing Networks and Markov Chains

Queueing Networks and Markov Chains
Subgradient and sampling algorithms for l1 regression

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Performance modeling and system management for multi-component online services

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks

A Synthetic Workload Generation Technique for Stress Testing Session-Based Systems

IEEE Transactions on Software Engineering
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Exploiting nonstationarity for performance prediction

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Anomaly detection and diagnosis in grid environments

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Models and framework for supporting runtime decisions in Web-based systems

ACM Transactions on the Web (TWEB)
Translating Service Level Objectives to lower level policies for multi-tier services

Cluster Computing
A regression-based analytic model for capacity planning of multi-tier applications

Cluster Computing
Modeling and exploiting query interactions in database systems

Proceedings of the 17th ACM conference on Information and knowledge management
Log summarization and anomaly detection for troubleshooting distributed systems

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
R-Capriccio: a capacity planning and anomaly detection tool for enterprise services with live workloads

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
R-Capriccio: a capacity planning and anomaly detection tool for enterprise services with live workloads

MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
Performance models oriented to the dynamic resource provisioning in shared data centres

ICACT'10 Proceedings of the 12th international conference on Advanced communication technology
Predicting completion times of batch query workloads using interaction-aware models and simulation

Proceedings of the 14th International Conference on Extending Database Technology
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Interaction-aware scheduling of report-generation workloads

The VLDB Journal — The International Journal on Very Large Data Bases
Separating Performance Anomalies from Workload-Explained Failures in Streaming Servers

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Information Technologies capacity planning in manufacturing systems: Proposition for a modelling process and application in the semiconductor industry

Computers in Industry
Towards building performance models for data-intensive workloads in public clouds

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Indirect estimation of service demands in the presence of structural changes

Performance Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding real, large distributed systems can be as difficult and important as building them. Complex modern applications that span geographic and organizational boundaries confound performance analysis in challenging new ways. These systems clearly demand new analytic methods, but we are wary of approaches that suffer from the same problems as the systems themselves (e.g., complexity and opacity). This paper shows how to obtain valuable insight into the performance of globally-distributed applications without abstruse techniques or detailed application knowledge: Simple queueing-theoretic observations together with standard optimization methods yield remarkably accurate performance models. The models can be used for performance anomaly detection, i.e., distinguishing performance faults from mere overload. This distinction can in turn suggest both performance debugging tools and remedial measures. Extensive empirical results from three production systems serving real customers--two of which are globally distributed and span administrative domains-- demonstrate that our method yields accurate performance models of diverse applications. Our method furthermore flagged as anomalous an episode of a real performance bug in one of the three systems.