Finding Local Anomalies in Very High Dimensional Space

Authors:
Timothy de Vries;Sanjay Chawla;Michael E. Houle
Affiliations:
-;-;-
Venue:
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Year:
2010

Citing 0
Cited 5

A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Local anomaly descriptor: a robust unsupervised algorithm for anomaly detection based on diffusion space

Proceedings of the 21st ACM international conference on Information and knowledge management
Finding the most descriptive substructures in graphs with discrete and numeric labels

NFMCP'12 Proceedings of the First international conference on New Frontiers in Mining Complex Patterns
Certainty-based active learning for sampling imbalanced datasets

Neurocomputing
Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection(RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.