Can shared-neighbor distances defeat the curse of dimensionality?

Authors:
Michael E. Houle;Hans-Peter Kriegel;Peer Kröger;Erich Schubert;Arthur Zimek
Affiliations:
National Institute of Informatics, Tokyo, Japan;Ludwig-Maximilians-Universität München, München, Germany;Ludwig-Maximilians-Universität München, München, Germany;Ludwig-Maximilians-Universität München, München, Germany;Ludwig-Maximilians-Universität München, München, Germany
Venue:
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Year:
2010

Citing 30
Cited 9

Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Re-designing distance functions and distance-based applications for high dimensional data

ACM SIGMOD Record
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'

IEEE Transactions on Knowledge and Data Engineering
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Distinctiveness-Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information

Proceedings of the 17th International Conference on Data Engineering
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Deflating the Dimensionality Curse Using Multiple Fractal Dimensions

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Navigating massive data sets via local clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The Amsterdam Library of Object Images

International Journal of Computer Vision
Iterative Projected Clustering by Subspace Mining

IEEE Transactions on Knowledge and Data Engineering
Example-Based Robust Outlier Detection in High Dimensional Datasets

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
Clustering Using a Similarity Measure Based on Shared Near Neighbors

IEEE Transactions on Computers
Angle-based outlier detection in high-dimensional data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Global Correlation Clustering Based on the Hough Transform

Statistical Analysis and Data Mining
The Relevant-Set Correlation Model for Data Clustering

Statistical Analysis and Data Mining
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
On High Dimensional Indexing of Uncertain Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
OutRank: ranking outliers in high dimensional data

ICDEW '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering Workshop

Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Quality of similarity rankings in time series

SSTD'11 Proceedings of the 12th international conference on Advances in spatial and temporal databases
Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
Subspace clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Discourse type clustering using POS n-gram profiles and high-dimensional embeddings

EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
When big data leads to lost data

Proceedings of the 5th Ph.D. workshop on Information and knowledge
A survey on enhanced subspace clustering

Data Mining and Knowledge Discovery
Machine learning based typology development in archaeology

Journal on Computing and Cultural Heritage (JOCCH)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of similarity measures for search, indexing, and data mining applications tends to degrade rapidly as the dimensionality of the data increases. The effects of the so-called 'curse of dimensionality' have been studied by researchers for data sets generated according to a single data distribution. In this paper, we study the effects of this phenomenon on different similarity measures for multiply-distributed data. In particular, we assess the performance of shared-neighbor similarity measures, which are secondary similarity measures based on the rankings of data objects induced by some primary distance measure. We find that rank-based similarity measures can result in more stable performance than their associated primary distance measures.