Adaptive system anomaly prediction for large-scale hosting infrastructures

Authors:
Yongmin Tan;Xiaohui Gu;Haixun Wang
Affiliations:
North Carolina State University, Raleigh, NC, USA;North Carolina State University, Raleigh, NC, USA;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Year:
2010

Citing 33
Cited 8

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient decision tree construction on streaming data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Graph-based anomaly detection

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Ensembles of Models for Automated Diagnosis of System Performance Problems

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Short term performance forecasting in enterprise systems

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

The Journal of Machine Learning Research
Detecting past and present intrusions through vulnerability-specific predicates

Proceedings of the twentieth ACM symposium on Operating systems principles
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
I/O system performance debugging using model-driven anomaly characterization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Challenges and experience in prototyping a multi-modal stream analytic and monitoring application on System S

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Toward Predictive Failure Management for Distributed Stream Processing Systems

ICDCS '08 Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
Failure Prediction in IBM BlueGene/L Event Logs

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Stop Chasing Trends: Discovering High Order Models in Evolving Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Online Anomaly Prediction for Robust Cluster Systems

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Fa: A System for Automating Failure Diagnosis

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Reference-driven performance anomaly identification

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Predictive algorithms in the management of computer systems

IBM Systems Journal
Lightweight, high-resolution monitoring for troubleshooting production systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Automating computer bottleneck detection with belief nets

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks

Finding semantics in time series

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems

Proceedings of the 9th international conference on Autonomic computing
Online black-box failure prediction for mission critical distributed systems

SAFECOMP'12 Proceedings of the 31st international conference on Computer Safety, Reliability, and Security
Anomaly management using complex event processing: extending data base technology paper

Proceedings of the 16th International Conference on Extending Database Technology
Model-based validation of streaming data: (industry article)

Proceedings of the 7th ACM international conference on Distributed event-based systems
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale hosting infrastructures require automatic system anomaly management to achieve continuous system operation. In this paper, we present a novel adaptive runtime anomaly prediction system, called ALERT, to achieve robust hosting infrastructures. In contrast to traditional anomaly detection schemes, ALERT aims at raising advance anomaly alerts to achieve just-in-time anomaly prevention. We propose a novel context-aware anomaly prediction scheme to improve prediction accuracy in dynamic hosting infrastructures. We have implemented the ALERT system and deployed it on several production hosting infrastructures such as IBM System S stream processing cluster and PlanetLab. Our experiments show that ALERT can achieve high prediction accuracy for a range of system anomalies and impose low overhead to the hosting infrastructure.