Nearest neighbors in high-dimensional data: the emergence and influence of hubs

  • Authors:
  • Miloš Radovanović;Alexandros Nanopoulos;Mirjana Ivanović

  • Affiliations:
  • University of Novi Sad, Novi Sad, Serbia;University of Hildesheim, Hildesheim, Germany;University of Novi Sad, Novi Sad, Serbia

  • Venue:
  • ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

High dimensionality can pose severe difficulties, widely recognized as different aspects of the curse of dimensionality. In this paper we study a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set. We show that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-occurrences). We examine the origin of this phenomenon, showing that it is an inherent property of high-dimensional vector space, and explore its influence on applications based on measuring distances in vector spaces, notably classification, clustering, and information retrieval.