A near-optimal similarity join algorithm and performance evaluation

  • Authors:
  • Zhengwu Yang;Guoqiang Yang

  • Affiliations:
  • Institute of Pattern Recognition and Intelligent Systems, Shanghai Jiao Tong University, Shanghai 200052, PR China;Department of Mathematics, Duke University, P.O. Box 90320, Durham, NC

  • Venue:
  • Information Sciences—Informatics and Computer Science: An International Journal
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Similarity join, a basic operation for multi-media databases, amounts to combinations of all pairs of points, with the distance between each pair bounded by a given parameter ε In this paper, properties of index-based join algorithms are studied and a highly efficient and near-optimal similarity join algorithm is proposed. Our algorithm utilizes the Breadth-First strategy, and guides the join computation and I/O access through the cache content. In contrast with many other proposed join algorithms, our algorithm is advantageous due to the essential independence of the ordering strategies and the minimal cache capacity requirement. As a result, a more precise plan for the sequence of join computations and I/O access can be realized. Generally, processing and accessing each page can be done with only one attempt. Qualitative and quantitative analysis of the performance of the algorithm is provided. Although only R-tree (a common index structure) based similarity join is discussed in this paper, the idea can be generalized to implement other join algorithms without substantial difficulties. Experiments based on our analysis indicate that the new algorithm yields superior performances across a wide range of dimensions and sizes of databases.