Information and control in gray-box systems
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Diagnosing network-wide traffic anomalies
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Mining anomalies using traffic feature distributions
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring
ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
E2EProf: Automated End-to-End Performance Management for Enterprise Systems
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Towards highly reliable enterprise network services via inference of multi-level dependencies
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
iManage: policy-driven self-management for enterprise-scale systems
Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
ACM Computing Surveys (CSUR)
Faster, larger, easier: reining real-time big data processing in cloud
Proceedings of the Posters and Demo Track
VScope: middleware for troubleshooting time-sensitive data center applications
Proceedings of the 13th International Middleware Conference
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
The online detection of anomalies is a vital element of operations in datacenters and in utility clouds like Amazon EC2. Given ever-increasing data center sizes coupled with the complexities of systems software, applications, and workload patterns, such anomaly detection must operate automatically at runtime and without the need for knowledge about normal or anomalous behaviors. Further, detection should function for different levels of abstraction like hardware and software, and for the multiple metrics used in cloud computing systems. This paper proposes EbAT -- Entropy-based Anomaly Testing -- offering novel methods that detect anomalies by analyzing for arbitrary metrics their distributions rather than individual metric thresholds. Entropy is used as a measurement that captures the degree of dispersal or concentration of such distributions, aggregating raw metric data across the cloud stack to form entropy time series. For scalability, such time series can then be combined hierarchically and across multiple cloud subsystems. Finally, online tools -- time series analysis, signal processing or subspace method -- are used to identify anomalies in entropy time series (matrices) in each subsystem or at each level of hierarchy. One outcome is our ability to 'zoom in' to the components and metrics where anomalies may be originating. Experimental results demonstrate the viability of the approach, with future experimentation focusing on scalable operation as well as on further reliability evaluation and improvement.