KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data streams under block evolution
ACM SIGKDD Explorations Newsletter
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Time Series Analysis, Forecasting and Control
Time Series Analysis, Forecasting and Control
Online Data Mining for Co-Evolving Time Sequences
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
StatStream: statistical monitoring of thousands of data streams in real time
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Detecting change in data streams
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Anomaly detection algorithms in logs of process aware systems
Proceedings of the 2008 ACM symposium on Applied computing
Data Streaming with Affinity Propagation
ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
ACM Computing Surveys (CSUR)
Fraud detection in process aware systems
Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web
Active learning and subspace clustering for anomaly detection
Intelligent Data Analysis
Online outlier detection for data streams
Proceedings of the 15th Symposium on International Database Engineering & Applications
Review: A review of novelty detection
Signal Processing
Research issues in outlier detection for data streams
ACM SIGKDD Explorations Newsletter
Hi-index | 0.00 |
We consider the problem of detecting anomalies in data that arise as multidimensional arrays with each dimension corresponding to the levels of a categorical variable. In typical data mining applications, the number of cells in such arrays are usually large. Our primary focus is detecting anomalies by comparing information at the current time to historical data. Naive approaches advocated in the process control literature do not work well in this scenario due to the multiple testing problem - performing multiple statistical tests on the same data produce excessive number of false positives. We use an Empirical Bayes method which works by fitting a two component gaussian mixture to deviations at current time. The approach is scalable to problems that involve monitoring massive number of cells and fast enough to be potentially useful in many streaming scenarios. We show the superiority of the method relative to a naive "per component error rate" procedure through simulation. A novel feature of our technique is the ability to suppress deviations that are merely the consequence of sharp changes in the marginal distributions. This research was motivated by the need to extract critical application information and business intelligence from the daily logs that accompany large-scale spoken dialog systems deployed by AT&T. We illustrate our method on one such system.