Spatial scan statistics: approximations and performance study

Authors:
Deepak Agarwal;Andrew McGregor;Jeff M. Phillips;Suresh Venkatasubramanian;Zhengyuan Zhu
Affiliations:
Yahoo! Research;University of Pennsylvania;Duke University;AT&T Labs - Research;University of North Carolina
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 9
Cited 14

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Communication complexity

Communication complexity
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Bump hunting in high-dimensional data

Statistics and Computing
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Rapid detection of significant spatial clusters

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An information statistics approach to data stream and communication complexity

Journal of Computer and System Sciences - Special issue on FOCS 2002
The hunting of the bump: on maximizing statistical discrepancy

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Scan Statistics on Enron Graphs

Computational & Mathematical Organization Theory

Statistical change detection for multi-dimensional data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Guessing the extreme values in a data set: a Bayesian method and its applications

The VLDB Journal — The International Journal on Very Large Data Bases
A LRT framework for fast spatial anomaly detection

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Trends Analysis of Topics Based on Temporal Segmentation

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Efficiently mining regional outliers in spatial data

SSTD'07 Proceedings of the 10th international conference on Advances in spatial and temporal databases
A Model-Agnostic Framework for Fast Spatial Anomaly Detection

ACM Transactions on Knowledge Discovery from Data (TKDD)
Regional behavior change detection via local spatial scan

Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems
Identifying, attributing and describing spatial bursts

Proceedings of the VLDB Endowment
Extracting hot spots of topics from time-stamped documents

Data & Knowledge Engineering
Spatio-temporal outlier detection in precipitation data

Sensor-KDD'08 Proceedings of the Second international conference on Knowledge Discovery from Sensor Data
Adaptive non-parametric identification of dense areas using cell phone records for urban analysis

Engineering Applications of Artificial Intelligence
Detecting spatio-temporal outliers in crowdsourced bathymetry data

Proceedings of the Second ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information
Mining trajectories of moving dynamic spatio-temporal regions in sensor datasets

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance. In recent years, there has been much effort invested in designing efficient algorithms for finding such "high discrepancy" regions, with methods ranging from fast heuristics for special cases, to general grid-based methods, and to efficient approximation algorithms with provable guarantees on performance and quality.In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describe a simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic) that improves the approximation versus run time trade-off of prior methods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct a detailed experimental comparison of these methods with a number of known methods, demonstrating that our approximation algorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off.All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of these methods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancy maximization problem must use space linear in the input.