Mining Outliers with Faster Cutoff Update and Space Utilization

Authors:
Chi-Cheong Szeto;Edward Hung
Affiliations:
Department of Computing, Hong Kong Polytechnic University,;Department of Computing, Hong Kong Polytechnic University,
Venue:
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2009

Citing 6
Cited 1

Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining distance-based outliers from large databases in any metric space

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Very efficient mining of distance-based outliers

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery

Mining Outliers with Adaptive Cutoff Update and Space Utilization (RACAS)

Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is desirable to find unusual data objects by Ramaswamy et al's distance-based outlier definition because only a metric distance function between two objects is required. It does not need any neighborhood distance threshold required by many existing algorithms based on the definition of Knorr and Ng. Bay and Schwabacher proposed an efficient algorithm ORCA, which can give near linear time performance, for this task. To further reduce the running time, we propose in this paper two algorithms RC and RS using the following two techniques respectively: (i) faster cutoff update, and (ii) space utilization after pruning. We tested RC, RS and RCS (a hybrid approach combining both RC and RS) on several large and high-dimensional real data sets with millions of objects. The experiments show that the speed of RCS is as fast as 1.4 to 2.3 times that of ORCA, and the improvement of RCS is relatively insensitive to the increase in the data size.