Algorithms for speeding up distance-based outlier detection

Authors:
Kanishka Bhaduri;Bryan L. Matthews;Chris R. Giannella
Affiliations:
Mission Critical Technologies Inc | NASA Ames Research Center, Moffett Field, CA, USA;Stinger Ghaffarian Technologies | NASA Ames Research Center, Moffett Field, CA, USA;The MITRE Corp. 300 Sentinel Dr., Suite 600, Annapolis Junction, MD, USA
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 15
Cited 0

LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Multidimensional binary search trees used for associative searching

Communications of the ACM
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Parallel Mining of Outliers in Large Database

Distributed and Parallel Databases
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Parallel Algorithms for Distance-Based and Density-Based Outliers

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Data Mining and Knowledge Discovery
Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery
DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets

ACM Transactions on Knowledge Discovery from Data (TKDD)
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
A distributed approach to detect outliers in very large data sets

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of distance-based outlier detection is difficult to solve efficiently in very large datasets because of potential quadratic time complexity. We address this problem and develop sequential and distributed algorithms that are significantly more efficient than state-of-the-art methods while still guaranteeing the same outliers. By combining simple but effective indexing and disk block accessing techniques, we have developed a sequential algorithm iOrca that is up to an order-of-magnitude faster than the state-of-the-art. The indexing scheme is based on sorting the data points in order of increasing distance from a fixed reference point and then accessing those points based on this sorted order. To speed up the basic outlier detection technique, we develop two distributed algorithms (DOoR and iDOoR) for modern distributed multi-core clusters of machines, connected on a ring topology. The first algorithm passes data blocks from each machine around the ring, incrementally updating the nearest neighbors of the points passed. By maintaining a cutoff threshold, it is able to prune a large number of points in a distributed fashion. The second distributed algorithm extends this basic idea with the indexing scheme discussed earlier. In our experiments, both distributed algorithms exhibit significant improvements compared to the state-of-the-art distributed method [13].