A survey on unsupervised outlier detection in high-dimensional numerical data

Authors:
Arthur Zimek;Erich Schubert;Hans-Peter Kriegel
Affiliations:
Department of Computing Science, University of Alberta, Edmonton, AB, Canada T6G 2E8;Institute for Informatics, Ludwig-Maximilians Universität München, Germany;Institute for Informatics, Ludwig-Maximilians Universität München, Germany
Venue:
Statistical Analysis and Data Mining
Year:
2012

Citing 113
Cited 5

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Fast parallel similarity search in multimedia databases

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A cost model for nearest neighbor search in high-dimensional data space

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
On the geometry of similarity search: dimensionality curse and concentration of measure

Information Processing Letters
Re-designing distance functions and distance-based applications for high dimensional data

ACM SIGMOD Record
Database-friendly random projections

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
Clustering Algorithms

Clustering Algorithms
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'

IEEE Transactions on Knowledge and Data Engineering
Analysis of the Clustering Properties of the Hilbert Space-Filling Curve

IEEE Transactions on Knowledge and Data Engineering
Redefining Clustering for High-Dimensional Applications

IEEE Transactions on Knowledge and Data Engineering
High Dimensional Similarity Search With Space Filling Curves

Proceedings of the 17th International Conference on Data Engineering
Distinctiveness-Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information

Proceedings of the 17th International Conference on Data Engineering
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Finding Intensional Knowledge of Distance-Based Outliers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Hilbert R-tree: An Improved R-tree using Fractals

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Enhancing Effectiveness of Outlier Detections for Low Density Patterns

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
A Unified Approach to Detecting Spatial Outliers

Geoinformatica
A unified approach for mining outliers

CASCON '97 Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research
Database-friendly random projections: Johnson-Lindenstrauss with binary coins

Journal of Computer and System Sciences - Special issu on PODS 2001
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Novelty detection: a review—part 1: statistical approaches

Signal Processing
Novelty detection: a review—part 2: neural network based approaches

Signal Processing
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
Iterative Projected Clustering by Subspace Mining

IEEE Transactions on Knowledge and Data Engineering
Outlier Mining in Large High-Dimensional Data Sets

IEEE Transactions on Knowledge and Data Engineering
An effective and efficient algorithm for high-dimensional outlier detection

The VLDB Journal — The International Journal on Very Large Data Bases
Feature bagging for outlier detection

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Example-Based Robust Outlier Detection in High Dimensional Datasets

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Deriving quantitative models for correlation clusters

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Query Processing in Arbitrary Subspaces Using Vector Approximations

SSDBM '06 Proceedings of the 18th International Conference on Scientific and Statistical Database Management
SLOM: a new measure for local spatial outliers

Knowledge and Information Systems
Theory of nearest neighbors indexability

ACM Transactions on Database Systems (TODS)
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Converting Output Scores from Outlier Detection Algorithms into Probability Estimates

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
An Efficient Reference-Based Approach to Outlier Detection in Large Datasets

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
An overview of anomaly detection techniques: Existing solutions and latest technological trends

Computer Networks: The International Journal of Computer and Telecommunications Networking
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
Outlier identification in high dimensions

Computational Statistics & Data Analysis
Hos-Miner: a system for detecting outlyting subspaces of high-dimensional data

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery
On variants of the Johnson–Lindenstrauss lemma

Random Structures & Algorithms
Angle-based outlier detection in high-dimensional data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
DUSC: Dimensionality Unbiased Subspace Clustering

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
EDSC: efficient density-based subspace clustering

Proceedings of the 17th ACM conference on Information and knowledge management
Global Correlation Clustering Based on the Hough Transform

Statistical Analysis and Data Mining
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Quality assessment of dimensionality reduction: Rank-based criteria

Neurocomputing
A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
On High Dimensional Indexing of Uncertain Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Similarity Search in Arbitrary Subspaces Under Lp-Norm

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Nearest neighbors in high-dimensional data: the emergence and influence of hubs

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
When is 'nearest neighbour' meaningful: A converse theorem and implications

Journal of Complexity
Is the Distance Compression Effect Overstated? Some Theory and Experimentation

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
A comprehensive survey of numeric and symbolic outlier mining techniques

Intelligent Data Analysis
Efficient Pruning Schemes for Distance-Based Outlier Detection

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
LoOP: local outlier probabilities

Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Correlation clustering

ACM SIGKDD Explorations Newsletter
Subspace and projected clustering: experimental evaluation and analysis

Knowledge and Information Systems
The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

Journal of Classification
Mining outliers with faster cutoff update and space utilization

Pattern Recognition Letters
On the existence of obstinate results in vector space models

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
GLS-SOD: a generalized local statistical approach for spatial outlier detection

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive outlierness for subspace outlier ranking

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
On the distance concentration awareness of certain data reduction techniques

Pattern Recognition
Can shared-neighbor distances defeat the curse of dimensionality?

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Subspace similarity search: efficient k-NN queries in arbitrary subspaces

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
SOREX: subspace outlier ranking exploration toolkit

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Distance-based outlier detection: consolidation and renewed bearing

Proceedings of the VLDB Endowment
Finding Local Anomalies in Very High Dimensional Space

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
An unbiased distance-based outlier detection approach for high-dimensional data

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Locality Sensitive Outlier Detection: A ranking driven approach

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
The role of hubness in clustering high-dimensional data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Quality of similarity rankings in time series

SSTD'11 Proceedings of the 12th international conference on Advances in spatial and temporal databases
Spatial outlier detection: data, algorithms, visualizations

SSTD'11 Proceedings of the 12th international conference on Advances in spatial and temporal databases
Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization

Information Sciences: an International Journal
Ranking outliers using symmetric neighborhood relationship

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Mining outliers with ensemble of heterogeneous detectors on random subspaces

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Visual evaluation of outlier detection models

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
The curse of dimensionality in data mining and time series prediction

IWANN'05 Proceedings of the 8th international conference on Artificial Neural Networks: computational Intelligence and Bioinspired Systems
Anomaly Detection for Discrete Sequences: A Survey

IEEE Transactions on Knowledge and Data Engineering
HiCS: High Contrast Subspaces for Density-Based Outlier Ranking

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Evaluation of Clusterings -- Metrics and Visual Support

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Clustering high dimensional data

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A survey on enhanced subspace clustering

Data Mining and Knowledge Discovery

Interactive data mining with 3D-parallel-coordinate-trees

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Subsampling for efficient and effective unsupervised outlier detection ensembles

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Systematic construction of anomaly detection benchmarks from real data

Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description
Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection

Data Mining and Knowledge Discovery
Ensembles for unsupervised outlier detection: challenges and research questions a position paper

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so-called ‘distance concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high-dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high-dimensional data. In this survey article, we discuss some important aspects of the ‘curse of dimensionality’ in detail and survey specialized algorithms for outlier detection from both categories. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 © 2012 Wiley Periodicals, Inc.