Detection of abrupt changes: theory and application
Detection of abrupt changes: theory and application
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A brief history of NTP time: memoirs of an Internet timekeeper
ACM SIGCOMM Computer Communication Review
Change-Point Monitoring for the Detection of DoS Attacks
IEEE Transactions on Dependable and Secure Computing
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
E2EProf: Automated End-to-End Performance Management for Enterprise Systems
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
SPADE: the system s declarative stream processing engine
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
What's going on?: learning communication rules in edge networks
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Fa: A System for Automating Failure Diagnosis
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
NAP: a building block for remediating performance bottlenecks via black box network analysis
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Automating network application dependency discovery: experiences, limitations, and new solutions
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
EntomoModel: Understanding and Avoiding Performance Anomaly Manifestations
MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
On Predictability of System Anomalies in Real World
MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
RainMon: an integrated approach to mining bursty timeseries monitoring data
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.01 |
Distributed applications running inside cloud are prone to performance anomalies due to various reasons such as insufficient resource allocations, unexpected workload increases, or software bugs. However, those applications often consist of multiple interacting components where one component anomaly may cause its dependent components to exhibit anomalous behavior as well. It is challenging to identify the faulty components among numerous distributed application components. In this paper, we present a Propagation-aware Anomaly Localization (PAL) system that can pinpoint the source faulty components in distributed applications by extracting anomaly propagation patterns. PAL provides a robust critical change point discovery algorithm to accurately capture the onset of anomaly symptoms at different application components. We then derive the propagation pattern by sorting all critical change points in chronological order. PAL is completely application-agnostic and non-intrusive, which only relies on system-level metrics. We have implemented PAL on top of the Xen platform and tested it on a production cloud computing infrastructure using the RUBiS online auction benchmark application and the IBM System S data streaming processing application with a range of common software bugs. Our experimental results show that PAL can pinpoint faulty components in distributed applications with high accuracy and low overhead.