What's wrong with high-dimensional similarity search?

  • Authors:
  • Stephen Blott;Roger Weber

  • Affiliations:
  • Dublin City University, Dublin, Ireland;Credit Suisse, Zurich, Switzerland

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Similarity search in high-dimensional vector spaces has been the subject of substantial research, motivated in part by the need to provide query support for images and other complex data types. The paper VLDB 1998 paper "Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces" analyses why this search problem can be so tricky, and shows with intuitive yet formal proofs that nearest-neighbour search is fundamentally linear beyond a certain dimensionality. Consequently, the paper proposes a new, linear search structure (the VA-File) which focuses on accelerating the indispensable sequential scan with approximations and computational schemes to reduce both CPU and IO efforts. Experiments with both synthetic and image data showed -- surprisingly, at the time -- that such schemes outperform hierarchical methods in all cases where the dimensionality is greater than five. In this paper, we review that work and identify both what we got right in the paper and its impact, and also (with the benefit of hindsight) those elements of the work for which we were off the mark. The lessons learned are relevant not just to the narrow area of similarity search, but also more broadly across the fields of databases and computing.