Failure detection and localization in component based systems by online tracking

Authors:
Haifeng Chen;Guofei Jiang;Cristian Ungureanu;Kenji Yoshihira
Affiliations:
NEC Laboratories America, Inc., Princeton, NJ;NEC Laboratories America, Inc., Princeton, NJ;NEC Laboratories America, Inc., Princeton, NJ;NEC Laboratories America, Inc., Princeton, NJ
Venue:
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Year:
2005

Citing 4
Cited 5

On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Machine Learning
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles

Combining supervised and unsupervised monitoring for fault detection in distributed computing systems

Proceedings of the 2006 ACM symposium on Applied computing
Mapping moving landscapes by mining mountains of logs: novel techniques for dependency model generation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping

IEEE Transactions on Knowledge and Data Engineering
Mining invariants from console logs for system problem detection

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Workload-aware anomaly detection for Web applications

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing complexity of today's systems makes fast and accurate failure detection essential for their use in mission-critical applications. Various monitoring methods provide a large amount of data about system's behavior. Analyzing this data with advanced statistical methods holds the promise of not only detecting the errors faster, but also detecting errors which are difficult to catch with current monitoring tools. Two challenges to building such detection tools are: the high dimensionality of observation data, which makes the models expensive to apply, and frequent system changes, which make the models expensive to update. In this paper, we present algorithms to reduce the dimensionality of data in a way that makes it easy to adapt to system changes. We decompose the observation data into signal and noise subspaces. Two statistics, the Hotelling T2 score and squared prediction error (SPE) are calculated to represent the data characteristics in signal and noise subspaces respectively. Instead of tracking the original data, we use a sequentially discounting expectation maximization (SDEM) algorithm to learn the distribution of the two extracted statistics. A failure event can then be detected based on the abnormal change of the distribution. Applying our technique to component interaction data in a simple e-commerce application shows better accuracy than building independent profiles for each component. Additionally, experiments on synthetic data show that the detection accuracy is high even for changing systems.