High dimensional nearest neighbor searching

  • Authors:
  • Hakan Ferhatosmanoglu;Ertem Tuncel;Divyakant Agrawal;Amr El Abbadi

  • Affiliations:
  • Computer Science and Engineering, Ohio State University, Columbus, OH;Electrical Engineering, University of California, Riverside;Computer Science, University of California, Santa Barbara;Computer Science, University of California, Santa Barbara

  • Venue:
  • Information Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

As databases increasingly integrate different types ot Information SUCh as time-series, multimedia and scientific data, it becomes necessary to support efficient retrieval of multi-dimensional data. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. As a result of the scale and high dimensional nature, the traditional techniques have proven inadequate. In this paper, we propose search techniques that are effective especially for large high dimensional data sets. We first propose VA+-file technique which is based on scalar quantization of the data. VA+-file is especially useful for searching exact nearest neighbors (NN) in non-uniform high dimensional data sets. We then discuss how to improve the search and make it progressive by allowing some approximations in the query result. We develop a general framework for approximate NN queries, discuss various approaches for progressive processing of similarity queries, and develop a metric for evaluation of such techniques. Finally, a new technique based on clustering is proposed, which merges the benefits of various approaches for progressive similarity searching. Extensive experimental evaluation is performed on several real-life data sets. The evaluation establishes the superiority of the proposed techniques over the existing techniques for high dimensional similarity searching. The techniques proposed in this paper are effective for real-life data sets, which are typically non-uniform, and they are scalable with respect to both dimensionality and size of the data set.