Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Algorithms for Mining Distance-Based Outliers in Large Datasets
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Mining distance-based outliers in near linear time with randomization and a simple pruning rule
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining distance-based outliers from large databases in any metric space
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Very efficient mining of distance-based outliers
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Fast mining of distance-based outliers in high-dimensional datasets
Data Mining and Knowledge Discovery
Mining Outliers with Adaptive Cutoff Update and Space Utilization (RACAS)
Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
Hi-index | 0.00 |
It is desirable to find unusual data objects by Ramaswamy et al's distance-based outlier definition because only a metric distance function between two objects is required. It does not need any neighborhood distance threshold required by many existing algorithms based on the definition of Knorr and Ng. Bay and Schwabacher proposed an efficient algorithm ORCA, which can give near linear time performance, for this task. To further reduce the running time, we propose in this paper two algorithms RC and RS using the following two techniques respectively: (i) faster cutoff update, and (ii) space utilization after pruning. We tested RC, RS and RCS (a hybrid approach combining both RC and RS) on several large and high-dimensional real data sets with millions of objects. The experiments show that the speed of RCS is as fast as 1.4 to 2.3 times that of ORCA, and the improvement of RCS is relatively insensitive to the increase in the data size.