Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems

Authors:
Zhen Guo;Guofei Jiang;Haifeng Chen;Kenji Yoshihira
Affiliations:
New Jersey Institute of Technology;NEC Laboratories America, Princeton, NJ;NEC Laboratories America, Princeton, NJ;NEC Laboratories America, Princeton, NJ
Venue:
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Year:
2006

Citing 0
Cited 19

Exterminator: automatically correcting memory errors with high probability

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems

IEEE Transactions on Knowledge and Data Engineering
Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping

IEEE Transactions on Knowledge and Data Engineering
A comparative study of pairwise regression techniques for problem determination

CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Monitoring multi-tier clustered systems with invariant metric relationships

Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems
Information-theoretic modeling for tracking the health of complex software systems

CASCON '08 Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds
Fault tolerant target tracking in sensor networks

Proceedings of the tenth ACM international symposium on Mobile ad hoc networking and computing
Ranking the importance of alerts for problem determination in large computer systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
System monitoring with metric-correlation models: problems and solutions

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
AdaptGuard: guarding adaptive systems from instability

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Heteroscedastic models to track relationships between management metrics

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
SelfTalk for Dena: query language and runtime support for evaluating system behavior

ACM SIGOPS Operating Systems Review
On the use of computational geometry to detect software faults at runtime

Proceedings of the 7th international conference on Autonomic computing
A query language and runtime tool for evaluating behavior of multi-tier servers

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Leveraging many simple statistical models to adaptively monitor software systems

International Journal of High Performance Computing and Networking
Ranking the importance of alerts for problem determination in large computer systems

Cluster Computing
Improved background modeling for real-time spatio-temporal non-parametric moving object detection strategies

Image and Vision Computing
Workload-aware anomaly detection for Web applications

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to their growing complexity, it becomes extremely difficult to detect and isolate faults in complex systems. While large amount of monitoring data can be collected from such systems for fault analysis, one challenge is how to correlate the data effectively across distributed systems and observation time. Much of the internal monitoring data reacts to the volume of user requests accordingly when user requests flow through distributed systems. In this paper, we use Gaussian mixture models to characterize probabilistic correlation between flow-intensities measured at multiple points. A novel algorithm derived from Expectation-Maximization (EM) algorithm is proposed to learn the "likely" boundary of normal data relationship, which is further used as an oracle in anomaly detection. Our recursive algorithm can adaptively estimate the boundary of dynamic data relationship and detect faults in real time. Our approach is tested in a real system with injected faults and the results demonstrate its feasibility.