High-Dimensional Similarity Searches Using A Metric Pseudo-Grid

  • Authors:
  • Christian Digout;Mario A. Nascimento

  • Affiliations:
  • Univ. of Alberta, Canada;Univ. of Alberta, Canada

  • Venue:
  • ICDEW '05 Proceedings of the 21st International Conference on Data Engineering Workshops
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Despite the proposal of numerous tree-based access structures for high dimensional similarity searches, techniques based on a sequential scan have been shown to be simple yet quite efficient alternatives. Given that random accesses to disk are expensive, a linear scan of the (smaller) pre-processed dataset is often much more efficient than even a relatively small number of random disk accesses yielded by tree-based indices. In this paper we present a technique which uses a pseudo-partition of a general metric space analog to the VA-file's partition of the vector space. The rationale is to use a number of pivot objects in the metric space, each one determining a number of hyper-rings in this space. The intersection of those rings, determine pseudo-cells analog to the VA-file cells in the vector space. In order to speedup query processing the data set is clustered (using any applicable clustering technique). Clusters not intersecting cells intersected by the query region cannot contribute to the answer set. Thus, only a few clusters are searched using an I/O efficient linear scan of the cluster's data. The proposed technique, which we call the M-GRID, is, by construction, applicable to both general metric spaces and to traditional vector spaces as long as a metric distance function is used. The M-GRID is robust to several parameters and experiments with synthetic and real data sets show that it is able to perform nearest neighbor queries up to 10 times faster than the VA-File.