PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications

Authors:
Hiep Nguyen;Yongmin Tan;Xiaohui Gu
Affiliations:
North Carolina State University;North Carolina State University;North Carolina State University
Venue:
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Year:
2011

Citing 18
Cited 1

Detection of abrupt changes: theory and application

Detection of abrupt changes: theory and application
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A brief history of NTP time: memoirs of an Internet timekeeper

ACM SIGCOMM Computer Communication Review
Change-Point Monitoring for the Detection of DoS Attacks

IEEE Transactions on Dependable and Secure Computing
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
E2EProf: Automated End-to-End Performance Management for Enterprise Systems

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
What's going on?: learning communication rules in edge networks

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Fa: A System for Automating Failure Diagnosis

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
NAP: a building block for remediating performance bottlenecks via black box network analysis

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Automating network application dependency discovery: experiences, limitations, and new solutions

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
EntomoModel: Understanding and Avoiding Performance Anomaly Manifestations

MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
On Predictability of System Anomalies in Real World

MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems

RainMon: an integrated approach to mining bursty timeseries monitoring data

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.01

Visualization

Abstract

Distributed applications running inside cloud are prone to performance anomalies due to various reasons such as insufficient resource allocations, unexpected workload increases, or software bugs. However, those applications often consist of multiple interacting components where one component anomaly may cause its dependent components to exhibit anomalous behavior as well. It is challenging to identify the faulty components among numerous distributed application components. In this paper, we present a Propagation-aware Anomaly Localization (PAL) system that can pinpoint the source faulty components in distributed applications by extracting anomaly propagation patterns. PAL provides a robust critical change point discovery algorithm to accurately capture the onset of anomaly symptoms at different application components. We then derive the propagation pattern by sorting all critical change points in chronological order. PAL is completely application-agnostic and non-intrusive, which only relies on system-level metrics. We have implemented PAL on top of the Xen platform and tested it on a production cloud computing infrastructure using the RUBiS online auction benchmark application and the IBM System S data streaming processing application with a range of common software bugs. Our experimental results show that PAL can pinpoint faulty components in distributed applications with high accuracy and low overhead.