EbAT: online methods for detecting utility cloud anomalies

Authors:
Chengwei Wang
Affiliations:
Georgia Institute of Technology, Atlanta, GA
Venue:
Proceedings of the 6th Middleware Doctoral Symposium
Year:
2009

Citing 13
Cited 3

Information and control in gray-box systems

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Diagnosing network-wide traffic anomalies

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Mining anomalies using traffic feature distributions

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
E2EProf: Automated End-to-End Performance Management for Enterprise Systems

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
iManage: policy-driven self-management for enterprise-scale systems

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Anomaly detection: A survey

ACM Computing Surveys (CSUR)

Faster, larger, easier: reining real-time big data processing in cloud

Proceedings of the Posters and Demo Track
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

The online detection of anomalies is a vital element of operations in datacenters and in utility clouds like Amazon EC2. Given ever-increasing data center sizes coupled with the complexities of systems software, applications, and workload patterns, such anomaly detection must operate automatically at runtime and without the need for knowledge about normal or anomalous behaviors. Further, detection should function for different levels of abstraction like hardware and software, and for the multiple metrics used in cloud computing systems. This paper proposes EbAT -- Entropy-based Anomaly Testing -- offering novel methods that detect anomalies by analyzing for arbitrary metrics their distributions rather than individual metric thresholds. Entropy is used as a measurement that captures the degree of dispersal or concentration of such distributions, aggregating raw metric data across the cloud stack to form entropy time series. For scalability, such time series can then be combined hierarchically and across multiple cloud subsystems. Finally, online tools -- time series analysis, signal processing or subspace method -- are used to identify anomalies in entropy time series (matrices) in each subsystem or at each level of hierarchy. One outcome is our ability to 'zoom in' to the components and metrics where anomalies may be originating. Experimental results demonstrate the viability of the approach, with future experimentation focusing on scalable operation as well as on further reliability evaluation and improvement.