Subsampling for efficient and effective unsupervised outlier detection ensembles

Authors:
Arthur Zimek;Matthew Gaudet;Ricardo J.G.B. Campello;Jörg Sander
Affiliations:
University of Alberta, Edmonton, Alberta, Canada;University of Alberta, Edmonton, Alberta, Canada;University of Alberta, Edmonton, Alberta, Canada;University of Alberta, Edmonton, Alberta, Canada
Venue:
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2013

Citing 31
Cited 1

LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data bubbles: quality preserving performance boosting for hierarchical clustering

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Neural Network Ensembles

IEEE Transactions on Pattern Analysis and Machine Intelligence
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Ensembles of Learning Machines

WIRN VIETRI 2002 Proceedings of the 13th Italian Workshop on Neural Nets-Revised Papers
Ensemble Methods in Machine Learning

MCS '00 Proceedings of the First International Workshop on Multiple Classifier Systems
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Feature bagging for outlier detection

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Clustering Ensembles: Models of Consensus and Weak Partitions

IEEE Transactions on Pattern Analysis and Machine Intelligence
Outlier detection by active learning

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Converting Output Scores from Outlier Detection Algorithms into Probability Estimates

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Moderate diversity for better cluster ensembles

Information Fusion
Angle-based outlier detection in high-dimensional data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Local peculiarity factor and its application in outlier detection

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets

ACM Transactions on Knowledge Discovery from Data (TKDD)
A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Efficient Pruning Schemes for Distance-Based Outlier Detection

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
LoOP: local outlier probabilities

Proceedings of the 18th ACM conference on Information and knowledge management
Distance-based outlier detection: consolidation and renewed bearing

Proceedings of the VLDB Endowment
Ranking outliers using symmetric neighborhood relationship

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Ensembles based on random projections to improve the accuracy of clustering algorithms

WIRN'05 Proceedings of the 16th Italian conference on Neural Nets
Mining outliers with ensemble of heterogeneous detectors on random subspaces

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
HiCS: High Contrast Subspaces for Density-Based Outlier Ranking

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Evaluation of Clusterings -- Metrics and Visual Support

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Cluster ensembles

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Interactive data mining with 3D-parallel-coordinate-trees

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Ensembles for unsupervised outlier detection: challenges and research questions a position paper

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce diversity among individual outlier detectors. We show analytically and experimentally that an outlier detector based on a subsample per se, besides inducing diversity, can, under certain conditions, already improve upon the results of the same outlier detector on the complete dataset. Building an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier detectors has just been transferred from the classification literature, here we also justify analytically why ensembles are also expected to work in the unsupervised area of outlier detection. As a side effect, running an ensemble of several outlier detectors on subsamples of the dataset is more efficient than ensembles based on other means of introducing diversity and, depending on the sample rate and the size of the ensemble, can be even more efficient than just the single outlier detector on the complete data.