The Concentration of Fractional Distances

Authors:
Damien Francois;Vincent Wertz;Michel Verleysen
Affiliations:
-;IEEE;IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2007

Citing 33
Cited 38

Radial basis functions for multivariable interpolation: a review

Algorithms for approximation
Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
Fast parallel similarity search in multimedia databases

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A cost model for nearest neighbor search in high-dimensional data space

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
Re-designing distance functions and distance-based applications for high dimensional data

ACM SIGMOD Record
Searching in metric spaces

ACM Computing Surveys (CSUR)
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'

IEEE Transactions on Knowledge and Data Engineering
VQ-index: an index structure for similarity searching in multimedia databases

Proceedings of the tenth ACM international conference on Multimedia
Distinctiveness-Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information

Proceedings of the 17th International Conference on Data Engineering
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Robust Similarity Measures for Mobile Object Trajectories

DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Peer-to-Peer Spatial Queries in Sensor Networks

P2P '03 Proceedings of the 3rd International Conference on Peer-to-Peer Computing
Analysis of predictive spatio-temporal queries

ACM Transactions on Database Systems (TODS)
Value and Relation Display for Interactive Exploration of High Dimensional Datasets

INFOVIS '04 Proceedings of the IEEE Symposium on Information Visualization
Compressing large boolean matrices using reordering techniques

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
The differogram: Non-parametric noise variance estimation and its use for model selection

Neurocomputing
Fractional distance measures for content-based image retrieval

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Measuring the difficulty of distance-based indexing

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Functional classification in Hilbert spaces

IEEE Transactions on Information Theory
Relevance feedback: a power tool for interactive content-based image retrieval

IEEE Transactions on Circuits and Systems for Video Technology

2008 Special Issue: An axiomatic approach to intrinsic dimension of a dataset

Neural Networks
An efficient low cost approach for on-line signature recognition based on length normalization and fractional distances

Pattern Recognition
Learning with Lq

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Nearest neighbors in high-dimensional data: the emergence and influence of hubs

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
When is 'nearest neighbour' meaningful: A converse theorem and implications

Journal of Complexity
Feature Selection in a Low Cost Signature Recognition System Based on Normalized Signatures and Fractional Distances

ICB '09 Proceedings of the Third International Conference on Advances in Biometrics
New instability results for high-dimensional nearest neighbor search

Information Processing Letters
Is the Distance Compression Effect Overstated? Some Theory and Experimentation

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Seize the (intra)day: Features selection and rules extraction for tradings on high-frequency data

Neurocomputing
Space-time tradeoffs for approximate nearest neighbor searching

Journal of the ACM (JACM)
How does high dimensionality affect collaborative filtering?

Proceedings of the third ACM conference on Recommender systems
Simbed: Similarity-Based Embedding

ICANN '09 Proceedings of the 19th International Conference on Artificial Neural Networks: Part II
Dequantizing compressed sensing with non-Gaussian constraints

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
On the importance of data balancing for symbolic regression

IEEE Transactions on Evolutionary Computation
On the existence of obstinate results in vector space models

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
On the distance concentration awareness of certain data reduction techniques

Pattern Recognition
Can shared-neighbor distances defeat the curse of dimensionality?

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Subspace similarity search: efficient k-NN queries in arbitrary subspaces

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
On the impact of the metrics choice in SOM learning: some empirical results from financial data

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part III
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
Mode estimation in high-dimensional spaces with flat-top kernels: Application to image denoising

Neurocomputing
The role of hubness in clustering high-dimensional data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Assessing the efficiency of health care providers: a SOM perspective

WSOM'11 Proceedings of the 8th international conference on Advances in self-organizing maps
Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering

Pattern Recognition
Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization

Information Sciences: an International Journal
A probabilistic approach to nearest-neighbor classification: naive hubness bayesian kNN

Proceedings of the 20th ACM international conference on Information and knowledge management
Non-parametric detection of meaningless distances in high dimensional data

Statistics and Computing
Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
Subspace clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Parsimonious Mahalanobis kernel for the classification of high dimensional data

Pattern Recognition
A survey on enhanced subspace clustering

Data Mining and Knowledge Discovery
Case-Centred multidimensional scaling for classification visualisation in medical diagnosis

HIS'13 Proceedings of the second international conference on Health Information Science
Type 1 and 2 mixtures of Kullback-Leibler divergences as cost functions in dimensionality reduction based on similarity preservation

Neurocomputing
Local and global scaling reduce hubs in space

The Journal of Machine Learning Research
Semi-supervised object recognition based on Connected Image Transformations

Expert Systems with Applications: An International Journal
Class imbalance and the curse of minority hubs

Knowledge-Based Systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned in the past, and fractional norms (Minkowski-like norms with an exponent less than one) were introduced to fight the concentration phenomenon. This paper justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic property of the distances and not an artifact from a finite sample. Furthermore, an estimation of the concentration as a function of the exponent of the distance and of the distribution of the data is given. It leads to the conclusion that, contrary to what is generally admitted, fractional norms are not always less concentrated than the euclidean norm; a counterexample is given to prove this claim. Theoretical arguments are presented, which show that the concentration phenomenon can appear for real data that do not match the hypotheses of the theorems, in particular, the assumption of independent and identically distributed variables. Finally, some insights about how to choose an optimal metric are given.