On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

Authors:
Charu C. Aggarwal;Alexander Hinneburg;Daniel A. Keim
Affiliations:
-;-;-
Venue:
ICDT '01 Proceedings of the 8th International Conference on Database Theory
Year:
2001

Citing 10
Cited 82

The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A cost model for nearest neighbor search in high-dimensional data space

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
The pyramid-technique: towards breaking the curse of dimensionality

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Fast Nearest Neighbor Search in High-Dimensional Space

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases

Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Towards Exploring Interactive Relationship between Clusters and Outliers in Multi-Dimensional Data Analysis

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Approximate Clustering of Noisy Biomedical Data

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Boosting the Immune System

ICARIS '08 Proceedings of the 7th international conference on Artificial Immune Systems
Easing the Dimensionality Curse by Stretching Metric Spaces

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Subspace sums for extracting non-random data from massive noise

Knowledge and Information Systems
On the effects of dimensionality on data analysis with neural networks

IWANN '03 Proceedings of the 7th International Work-Conference on Artificial and Natural Neural Networks: Part II: Artificial Neural Nets Problem Solving Methods
Is the Distance Compression Effect Overstated? Some Theory and Experimentation

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Efficient Clustering of Web-Derived Data Sets

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Indexing the Function: An Efficient Algorithm for Multi-dimensional Search with Expensive Distance Functions

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Subspace and projected clustering: experimental evaluation and analysis

Knowledge and Information Systems
Detecting New Kinds of Patient Safety Incidents

DS '09 Proceedings of the 12th International Conference on Discovery Science
Shape-Based Autotagging of 3D Models for Retrieval

SAMT '09 Proceedings of the 4th International Conference on Semantic and Digital Media Technologies: Semantic Multimedia
Subspace methods for retrieval of general 3D models

Computer Vision and Image Understanding
A network-based model for high-dimensional information filtering

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
On the efficient computation of robust regression estimators

Computational Statistics & Data Analysis
Automatic configuration of spectral dimensionality reduction methods

Pattern Recognition Letters
CP-index: using clustering and pivots for indexing non-metric spaces

Proceedings of the Third International Conference on SImilarity Search and APplications
Metric spaces in data mining: applications to clustering

SIGSPATIAL Special
Can shared-neighbor distances defeat the curse of dimensionality?

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Subspace similarity search: efficient k-NN queries in arbitrary subspaces

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
On the impact of the metrics choice in SOM learning: some empirical results from financial data

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part III
Techniques for power reduction in an SIMD implementation of the VQ/SOM algorithms

Neurocomputing
Towards improving a similarity search approach

Proceedings of the 48th Annual Southeast Regional Conference
A unifying criterion for unsupervised clustering and feature selection

Pattern Recognition
On (not) indexing quadratic form distance by metric access methods

Proceedings of the 14th International Conference on Extending Database Technology
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
On nonmetric similarity search problems in complex domains

ACM Computing Surveys (CSUR)
Fast moment estimation in data streams in optimal space

Proceedings of the forty-third annual ACM symposium on Theory of computing
Electrostatic field framework for supervised and semi-supervised learning from incomplete data

Natural Computing: an international journal
Enhancing grid-density based clustering for high dimensional data

Journal of Systems and Software
Combining instance selection methods based on data characterization: An approach to increase their effectiveness

Information Sciences: an International Journal
The role of hubness in clustering high-dimensional data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Assessing the efficiency of health care providers: a SOM perspective

WSOM'11 Proceedings of the 8th international conference on Advances in self-organizing maps
Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Quality of similarity rankings in time series

SSTD'11 Proceedings of the 12th international conference on Advances in spatial and temporal databases
A modified apriori algorithm for analysing high-dimensional gene data

IDEAL'11 Proceedings of the 12th international conference on Intelligent data engineering and automated learning
Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering

Pattern Recognition
Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization

Information Sciences: an International Journal
Non-parametric detection of meaningless distances in high dimensional data

Statistics and Computing
Trading precision for speed: localised similarity functions

CIVR'05 Proceedings of the 4th international conference on Image and Video Retrieval
Adapting k-means algorithm for discovering clusters in subspaces

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
On fast non-metric similarity search by metric access methods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Data mining from a patient safety database: the lessons learned

Data Mining and Knowledge Discovery
ESPClust: an effective skew prevention method for model-based document clustering

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Fractional distance measures for content-based image retrieval

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
MMPClust: a skew prevention algorithm for model-based document clustering

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
The curse of dimensionality in data mining and time series prediction

IWANN'05 Proceedings of the 8th international conference on Artificial Neural Networks: computational Intelligence and Bioinspired Systems
Interactions between document representation and feature selection in text categorization

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Analogy-based reasoning in classifier construction

Transactions on Rough Sets IV
On finding the natural number of topics with latent dirichlet allocation: some observations

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Ranking invariance based on similarity measures in document retrieval

AMR'05 Proceedings of the Third international conference on Adaptive Multimedia Retrieval: user, context, and feedback
Measuring the complexity of a collection of documents

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Revisiting centrality-as-relevance: support sets and similarity as geometric proximity

Journal of Artificial Intelligence Research
Applying instance-based techniques to prediction of final outcome in acute stroke

Artificial Intelligence in Medicine
Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
Subspace clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Objective function-based clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Clustering high dimensional data

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
The small sample size problem of ICA: A comparative study and analysis

Pattern Recognition
Perceptual indiscernibility, rough sets, descriptively near sets, and image analysis

Transactions on Rough Sets XV
Center-Based Indexing in Vector and Metric Spaces

Fundamenta Informaticae
Volume visualization and visual queries for large high-dimensional datasets

VISSYM'04 Proceedings of the Sixth Joint Eurographics - IEEE TCVG conference on Visualization
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Parsimonious Mahalanobis kernel for the classification of high dimensional data

Pattern Recognition
The bitvector machine: a fast and robust machine learning algorithm for non-linear problems

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Learning a ground object manifold for interpreting high-resolution sensor image

AICI'12 Proceedings of the 4th international conference on Artificial Intelligence and Computational Intelligence
Hybrid negative selection approach for anomaly detection

CISIM'12 Proceedings of the 11th IFIP TC 8 international conference on Computer Information Systems and Industrial Management
Assisted descriptor selection based on visual comparative data analysis

EuroVis'11 Proceedings of the 13th Eurographics / IEEE - VGTC conference on Visualization
A survey on enhanced subspace clustering

Data Mining and Knowledge Discovery
On the equivalence of PLSI and projected clustering

ACM SIGMOD Record
Dimensionality Reduction with Unsupervised Feature Selection and Applying Non-Euclidean Norms for Classification Accuracy

International Journal of Data Warehousing and Mining
Multimedia information retrieval in a social context

PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
Training data selection for cross-project defect prediction

Proceedings of the 9th International Conference on Predictive Models in Software Engineering
Local and global scaling reduce hubs in space

The Journal of Machine Learning Research
Classification and outlier detection based on topic based pattern synthesis

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Black box scheduling for resource intensive virtual machine workloads with interference models

Future Generation Computer Systems
Class imbalance and the curse of minority hubs

Knowledge-Based Systems
Revisiting centrality-as-relevance: support sets and similarity as geometric proximity: extended abstract

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Research issues in outlier detection for data streams

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, the effect of the curse of high dimensionality has been studied in great detail on several problems such as clustering, nearest neighbor search, and indexing. In high dimensional space the data becomes sparse, and traditional indexing and algorithmic techniques fail from a efficiency and/or effectiveness perspective. Recent research results show that in high dimensional space, the concept of proximity, distance or nearest neighbor may not even be qualitatively meaningful. In this paper, we view the dimensionality curse from the point of view of the distance metrics which are used to measure the similarity between objects. We specifically examine the behavior of the commonly used Lk norm and show that the problem of meaningfulness in high dimensionality is sensitive to the value of k. For example, this means that the Manhattan distance metric (L1 norm) is consistently more preferable than the Euclidean distance metric (L2 norm) for high dimensional data mining applications. Using the intuition derived from our analysis, we introduce and examine a natural extension of the Lk norm to fractional distance metrics. We show that the fractional distance metric provides more meaningful results both from the theoretical and empirical perspective. The results show that fractional distance metrics can significantly improve the effectiveness of standard clustering algorithms such as the k-means algorithm.