Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Communication complexity
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
Bump hunting in high-dimensional data
Statistics and Computing
An Approximate L1-Difference Algorithm for Massive Data Streams
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Rapid detection of significant spatial clusters
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An information statistics approach to data stream and communication complexity
Journal of Computer and System Sciences - Special issue on FOCS 2002
The hunting of the bump: on maximizing statistical discrepancy
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Scan Statistics on Enron Graphs
Computational & Mathematical Organization Theory
Statistical change detection for multi-dimensional data
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Guessing the extreme values in a data set: a Bayesian method and its applications
The VLDB Journal — The International Journal on Very Large Data Bases
A LRT framework for fast spatial anomaly detection
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Trends Analysis of Topics Based on Temporal Segmentation
DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Efficiently mining regional outliers in spatial data
SSTD'07 Proceedings of the 10th international conference on Advances in spatial and temporal databases
A Model-Agnostic Framework for Fast Spatial Anomaly Detection
ACM Transactions on Knowledge Discovery from Data (TKDD)
Regional behavior change detection via local spatial scan
Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems
Identifying, attributing and describing spatial bursts
Proceedings of the VLDB Endowment
Extracting hot spots of topics from time-stamped documents
Data & Knowledge Engineering
Spatio-temporal outlier detection in precipitation data
Sensor-KDD'08 Proceedings of the Second international conference on Knowledge Discovery from Sensor Data
Adaptive non-parametric identification of dense areas using cell phone records for urban analysis
Engineering Applications of Artificial Intelligence
Detecting spatio-temporal outliers in crowdsourced bathymetry data
Proceedings of the Second ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information
Mining trajectories of moving dynamic spatio-temporal regions in sensor datasets
Data Mining and Knowledge Discovery
Hi-index | 0.00 |
Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance. In recent years, there has been much effort invested in designing efficient algorithms for finding such "high discrepancy" regions, with methods ranging from fast heuristics for special cases, to general grid-based methods, and to efficient approximation algorithms with provable guarantees on performance and quality.In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describe a simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic) that improves the approximation versus run time trade-off of prior methods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct a detailed experimental comparison of these methods with a number of known methods, demonstrating that our approximation algorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off.All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of these methods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancy maximization problem must use space linear in the input.