Processing distance-based queries in multidimensional data spaces using R-trees

  • Authors:
  • Antonio Corral;Joaquin Cañadas;Michael Vassilakopoulos

  • Affiliations:
  • Department of Languages and Computation, University of Almeria, Almeria, Spain;Department of Languages and Computation, University of Almeria, Almeria, Spain;Department of Information Technology, Technological Educational Institute of Thessaloniki, Greece

  • Venue:
  • PCI'01 Proceedings of the 8th Panhellenic conference on Informatics
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

In modern database applications the similarity, or dissimilarity of data objects is examined by performing distance-based queries (DBQs) on multidimensional data. The R-tree and its variations are commonly cited multidimensional access methods. In this paper, we investigate the performance of the most representative distance-based queries in multidimensional data spaces, where the point datasets are indexed by tree-like structures belonging to the R-tree family. In order to perform the K-nearest neighbor query (K-NNQ) and the K-closest pair query (K-CPQ), non-incremental recursive branch-and-bound algorithms are employed. The K-CPQ is shown to be a very expensive query for datasets of high cardinalities that becomes even more costly as the dimensionality increases. We also give ɛ-approximate versions of DBQ algorithms that can be performed faster than the exact ones, at the expense of introducing a distance relative error of the result. Experimentation with synthetic multidimensional point datasets, following Uniform and Gaussian distributions, reveals that the best index structure for K-NNQ is the X-tree. However, for K-CPQ, th e R*-tree outperforms th e X-tree in respect to the response time and the number of disk accesses, when an LRU buffer is used. Moreover, the application of the ɛ-approximate technique on the recursive K-CPQ algorithm leads to acceptable approximations of the result quickly, although the tradeoff between cost and accuracy cannot be easily controlled by the users.