What's wrong with high-dimensional similarity search?

Authors:
Stephen Blott;Roger Weber
Affiliations:
Dublin City University, Dublin, Ireland;Credit Suisse, Zurich, Switzerland
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 1
Cited 1

A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases

Scalable kNN search on vertically stored time series

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity search in high-dimensional vector spaces has been the subject of substantial research, motivated in part by the need to provide query support for images and other complex data types. The paper VLDB 1998 paper "Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces" analyses why this search problem can be so tricky, and shows with intuitive yet formal proofs that nearest-neighbour search is fundamentally linear beyond a certain dimensionality. Consequently, the paper proposes a new, linear search structure (the VA-File) which focuses on accelerating the indispensable sequential scan with approximations and computational schemes to reduce both CPU and IO efforts. Experiments with both synthetic and image data showed -- surprisingly, at the time -- that such schemes outperform hierarchical methods in all cases where the dimensionality is greater than five. In this paper, we review that work and identify both what we got right in the paper and its impact, and also (with the benefit of hindsight) those elements of the work for which we were off the mark. The lessons learned are relevant not just to the narrow area of similarity search, but also more broadly across the fields of databases and computing.