When is 'nearest neighbour' meaningful: A converse theorem and implications

Authors:
Robert J. Durrant;Ata Kabán
Affiliations:
School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK;School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
Venue:
Journal of Complexity
Year:
2009

Citing 6
Cited 11

A unifying review of linear Gaussian models

Neural Computation
On the geometry of similarity search: dimensionality curse and concentration of measure

Information Processing Letters
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
BagBoosting for tumor classification with gene expression data

Bioinformatics
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering

On the distance concentration awareness of certain data reduction techniques

Pattern Recognition
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
The role of hubness in clustering high-dimensional data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
A probabilistic approach to nearest-neighbor classification: naive hubness bayesian kNN

Proceedings of the 20th ACM international conference on Information and knowledge management
Non-parametric detection of meaningless distances in high dimensional data

Statistics and Computing
Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Towards large scale continuous EDA: a random matrix theory perspective

Proceedings of the 15th annual conference on Genetic and evolutionary computation
Class imbalance and the curse of minority hubs

Knowledge-Based Systems
Category role aided market segmentation approach to convenience store chain category management

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Beyer et al. gave a sufficient condition for the high dimensional phenomenon known as the concentration of distances. Their work has pinpointed serious problems due to nearest neighbours not being meaningful in high dimensions. Here we establish the converse of their result, in order to answer the question as to when nearest neighbour is still meaningful in arbitrarily high dimensions. We then show for a class of realistic data distributions having non-i.i.d. dimensions, namely the family of linear latent variable models, that the Euclidean distance will not concentrate as long as the amount of 'relevant' dimensions grows no slower than the overall data dimensions. This condition is, of course, often not met in practice. After numerically validating our findings, we examine real data situations in two different areas (text-based document collections and gene expression arrays), which suggest that the presence or absence of distance concentration in high dimensional problems plays a role in making the data hard or easy to work with.