The hunting of the bump: on maximizing statistical discrepancy

Authors:
Deepak Agarwal;Jeff M. Phillips;Suresh Venkatasubramanian
Affiliations:
AT&T Labs - Research;Duke University;AT&T Labs - Research
Venue:
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Year:
2006

Citing 7
Cited 10

Computing the discrepancy

SCG '93 Proceedings of the ninth annual symposium on Computational geometry
Bump hunting in high-dimensional data

Statistics and Computing
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Rapid detection of significant spatial clusters

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On detecting space-time clusters

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient calculation of interval scores for DNA copy number data analysis

RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology

Spatial scan statistics: approximations and performance study

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Statistical change detection for multi-dimensional data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Region-restricted clustering for geographic data mining

Computational Geometry: Theory and Applications
Guessing the extreme values in a data set: a Bayesian method and its applications

The VLDB Journal — The International Journal on Very Large Data Bases
On burstiness-aware search for document sequences

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A LRT framework for fast spatial anomaly detection

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficiently mining regional outliers in spatial data

SSTD'07 Proceedings of the 10th international conference on Advances in spatial and temporal databases
A Model-Agnostic Framework for Fast Spatial Anomaly Detection

ACM Transactions on Knowledge Discovery from Data (TKDD)
Identifying, attributing and describing spatial bursts

Proceedings of the VLDB Endowment
Spatio-temporal outlier detection in precipitation data

Sensor-KDD'08 Proceedings of the Second international conference on Knowledge Discovery from Sensor Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrepancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy.In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees, both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy; roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the same set of shapes.We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for data from a Poisson distribution. We present an algorithm running in O(1/ε n2 log2n) that computes the maximum discrepancy rectangle to within additive error ε, for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian, Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time O(n4).