When is 'nearest neighbour' meaningful: A converse theorem and implications

  • Authors:
  • Robert J. Durrant;Ata Kabán

  • Affiliations:
  • School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK;School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK

  • Venue:
  • Journal of Complexity
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Beyer et al. gave a sufficient condition for the high dimensional phenomenon known as the concentration of distances. Their work has pinpointed serious problems due to nearest neighbours not being meaningful in high dimensions. Here we establish the converse of their result, in order to answer the question as to when nearest neighbour is still meaningful in arbitrarily high dimensions. We then show for a class of realistic data distributions having non-i.i.d. dimensions, namely the family of linear latent variable models, that the Euclidean distance will not concentrate as long as the amount of 'relevant' dimensions grows no slower than the overall data dimensions. This condition is, of course, often not met in practice. After numerically validating our findings, we examine real data situations in two different areas (text-based document collections and gene expression arrays), which suggest that the presence or absence of distance concentration in high dimensional problems plays a role in making the data hard or easy to work with.