A near-optimal similarity join algorithm and performance evaluation

Authors:
Zhengwu Yang;Guoqiang Yang
Affiliations:
Institute of Pattern Recognition and Intelligent Systems, Shanghai Jiao Tong University, Shanghai 200052, PR China;Department of Mathematics, Duke University, P.O. Box 90320, Durham, NC
Venue:
Information Sciences—Informatics and Computer Science: An International Journal
Year:
2004

Citing 11
Cited 2

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Linear clustering of objects with multiple attributes

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Efficient processing of spatial joins using R-trees

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
High performance clustering based on the similarity join

Proceedings of the ninth international conference on Information and knowledge management
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
PROBE Spatial Data Modeling and Query Processing in an Image Database Application

IEEE Transactions on Software Engineering
Parallel Processing of Spatial Joins Using R-trees

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
A Cost Model and Index Architecture for the Similarity Join

Proceedings of the 17th International Conference on Data Engineering
Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

A performance comparison of distance-based query algorithms using R-trees in spatial databases

Information Sciences: an International Journal
Automatic threshold estimation for data matching applications

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity join, a basic operation for multi-media databases, amounts to combinations of all pairs of points, with the distance between each pair bounded by a given parameter ε In this paper, properties of index-based join algorithms are studied and a highly efficient and near-optimal similarity join algorithm is proposed. Our algorithm utilizes the Breadth-First strategy, and guides the join computation and I/O access through the cache content. In contrast with many other proposed join algorithms, our algorithm is advantageous due to the essential independence of the ordering strategies and the minimal cache capacity requirement. As a result, a more precise plan for the sequence of join computations and I/O access can be realized. Generally, processing and accessing each page can be done with only one attempt. Qualitative and quantitative analysis of the performance of the algorithm is provided. Although only R-tree (a common index structure) based similarity join is discussed in this paper, the idea can be generalized to implement other join algorithms without substantial difficulties. Experiments based on our analysis indicate that the new algorithm yields superior performances across a wide range of dimensions and sizes of databases.