An Efficient Reference-Based Approach to Outlier Detection in Large Datasets

Authors:
Yaling Pei;Osmar R. Zaiane;Yong Gao
Affiliations:
University of Alberta, Canada;University of Alberta, Canada;University of British Columbia Okanagan, Canada
Venue:
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Year:
2006

Citing 0
Cited 6

Angle-based outlier detection in high-dimensional data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Outlier Detection Based on Voronoi Diagram

ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
NADO: network anomaly detection using outlier approach

Proceedings of the 2011 International Conference on Communication, Computing & Security
Visual evaluation of outlier detection models

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

A bottleneck to detecting distance and density based outliers is that a nearest-neighbor search is required for each of the data points, resulting in a quadratic number of pairwise distance evaluations. In this paper, we propose a new method that uses the relative degree of density with respect to a fixed set of reference points to approximate the degree of density defined in terms of nearest neighbors of a data point. The running time of our algorithm based on this approximation is O(R_n log n) where n is the size of dataset and R is the number of reference points. Candidate outliers are ranked based on the outlier score assigned to each data point. Theoretical analysis and empirical studies show that our method is effective, efficient, and highly scalable to very large datasets.