An Empirical Bayes Approach to Detect Anomalies in Dynamic Multidimensional Arrays

Authors:
Deepak Agarwal
Affiliations:
AT&T Labs-Research
Venue:
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Year:
2005

Citing 7
Cited 8

Squashing flat files flatter

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data streams under block evolution

ACM SIGKDD Explorations Newsletter
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Time Series Analysis, Forecasting and Control

Time Series Analysis, Forecasting and Control
Online Data Mining for Co-Evolving Time Sequences

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
StatStream: statistical monitoring of thousands of data streams in real time

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Detecting change in data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Anomaly detection algorithms in logs of process aware systems

Proceedings of the 2008 ACM symposium on Applied computing
Data Streaming with Affinity Propagation

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
Fraud detection in process aware systems

Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web
Active learning and subspace clustering for anomaly detection

Intelligent Data Analysis
Online outlier detection for data streams

Proceedings of the 15th Symposium on International Database Engineering & Applications
Review: A review of novelty detection

Signal Processing
Research issues in outlier detection for data streams

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of detecting anomalies in data that arise as multidimensional arrays with each dimension corresponding to the levels of a categorical variable. In typical data mining applications, the number of cells in such arrays are usually large. Our primary focus is detecting anomalies by comparing information at the current time to historical data. Naive approaches advocated in the process control literature do not work well in this scenario due to the multiple testing problem - performing multiple statistical tests on the same data produce excessive number of false positives. We use an Empirical Bayes method which works by fitting a two component gaussian mixture to deviations at current time. The approach is scalable to problems that involve monitoring massive number of cells and fast enough to be potentially useful in many streaming scenarios. We show the superiority of the method relative to a naive "per component error rate" procedure through simulation. A novel feature of our technique is the ability to suppress deviations that are merely the consequence of sharp changes in the marginal distributions. This research was motivated by the need to extract critical application information and business intelligence from the daily logs that accompany large-scale spoken dialog systems deployed by AT&T. We illustrate our method on one such system.