Very efficient mining of distance-based outliers

Authors:
Fabrizio Angiulli;Fabio Fassetti
Affiliations:
Università della Calabria, Rende (CS), Italy;Università della Calabria, Rende (CS), Italy
Venue:
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Year:
2007

Citing 17
Cited 5

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Multidimensional binary search trees used for associative searching

Communications of the ACM
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Searching in metric spaces

ACM Computing Surveys (CSUR)
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Finding Intensional Knowledge of Distance-Based Outliers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Outlier Mining in Large High-Dimensional Data Sets

IEEE Transactions on Knowledge and Data Engineering
Distance-Based Detection and Prediction of Outliers

IEEE Transactions on Knowledge and Data Engineering
Mining distance-based outliers from large databases in any metric space

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Disk aware discord discovery: finding unusual time series in terabyte sized datasets

Knowledge and Information Systems
DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets

ACM Transactions on Knowledge Discovery from Data (TKDD)
Mining Outliers with Faster Cutoff Update and Space Utilization

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Distance-based outlier queries in data streams: the novel task and algorithms

Data Mining and Knowledge Discovery
Distance-based outlier detection: consolidation and renewed bearing

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work a novel algorithm, named DOLPHIN, for detecting distance-based outliers is presented. The proposed algorithm performs only two sequential scans of the dataset. It needs to store into main memory a portion of the dataset, to efficiently search for neighbors and early prune inliers. The strategy pursued by the algorithm allows to keep this portion very small. Both theoretical justification and empirical evidence that the size of the stored data amounts only to a few percent of the dataset are provided. Another important feature of DOLPHIN is that the memory-resident data are indexed by using a suitable proximity search approach. This allows to search for nearest neighbors looking only at a small subset of the main memory stored data. Temporal and spatial cost analysis show that the novel algorithm achieves both near linear CPU and I/O cost. DOLPHIN has been compared with state of the art methods, showing that it outperforms existing ones.