Non-parametric detection of meaningless distances in high dimensional data

Authors:
Ata Kabán
Affiliations:
School of Computer Science, The University of Birmingham, Edgbaston, UK B15 2TT
Venue:
Statistics and Computing
Year:
2012

Citing 12
Cited 1

When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Fast Approximate Search Algorithm for Nearest Neighbor Queries in High Dimensions

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
Measure concentration of strongly mixing processes with applications

Measure concentration of strongly mixing processes with applications
Classification of Anti-learnable Biological and Synthetic Data

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine

Computational Statistics & Data Analysis
On the Design and Applicability of Distance Functions in High-Dimensional Data Space

IEEE Transactions on Knowledge and Data Engineering
When is 'nearest neighbour' meaningful: A converse theorem and implications

Journal of Complexity
New instability results for high-dimensional nearest neighbor search

Information Processing Letters
On the distance concentration awareness of certain data reduction techniques

Pattern Recognition
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research

Prior-free exploration bonus for and beyond near bayes-optimal behavior

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distance concentration is the phenomenon that, in certain conditions, the contrast between the nearest and the farthest neighbouring points vanishes as the data dimensionality increases. It affects high dimensional data processing, analysis, retrieval, and indexing, which all rely on some notion of distance or dissimilarity. Previous work has characterised this phenomenon in the limit of infinite dimensions. However, real data is finite dimensional, and hence the infinite-dimensional characterisation is insufficient. Here we quantify the phenomenon more precisely, for the possibly high but finite dimensional case in a distribution-free manner, by bounding the tails of the probability that distances become meaningless. As an application, we show how this can be used to assess the concentration of a given distance function in some unknown data distribution solely on the basis of an available data sample from it. This can be used to test and detect problematic cases more rigorously than it is currently possible, and we demonstrate the working of this approach on both synthetic data and ten real-world data sets from different domains.