Exterminator: automatically correcting memory errors with high probability
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems
IEEE Transactions on Knowledge and Data Engineering
Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping
IEEE Transactions on Knowledge and Data Engineering
A comparative study of pairwise regression techniques for problem determination
CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Monitoring multi-tier clustered systems with invariant metric relationships
Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems
Information-theoretic modeling for tracking the health of complex software systems
CASCON '08 Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds
Fault tolerant target tracking in sensor networks
Proceedings of the tenth ACM international symposium on Mobile ad hoc networking and computing
Ranking the importance of alerts for problem determination in large computer systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
System monitoring with metric-correlation models: problems and solutions
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
AdaptGuard: guarding adaptive systems from instability
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Heteroscedastic models to track relationships between management metrics
IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
SelfTalk for Dena: query language and runtime support for evaluating system behavior
ACM SIGOPS Operating Systems Review
On the use of computational geometry to detect software faults at runtime
Proceedings of the 7th international conference on Autonomic computing
A query language and runtime tool for evaluating behavior of multi-tier servers
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Leveraging many simple statistical models to adaptively monitor software systems
International Journal of High Performance Computing and Networking
Image and Vision Computing
Workload-aware anomaly detection for Web applications
Journal of Systems and Software
Hi-index | 0.00 |
Due to their growing complexity, it becomes extremely difficult to detect and isolate faults in complex systems. While large amount of monitoring data can be collected from such systems for fault analysis, one challenge is how to correlate the data effectively across distributed systems and observation time. Much of the internal monitoring data reacts to the volume of user requests accordingly when user requests flow through distributed systems. In this paper, we use Gaussian mixture models to characterize probabilistic correlation between flow-intensities measured at multiple points. A novel algorithm derived from Expectation-Maximization (EM) algorithm is proposed to learn the "likely" boundary of normal data relationship, which is further used as an oracle in anomaly detection. Our recursive algorithm can adaptively estimate the boundary of dynamic data relationship and detect faults in real time. Our approach is tested in a real system with injected faults and the results demonstrate its feasibility.