High-Dimensional Similarity Joins

Authors:
K. Shim;R. Srikant;R. Agrawal
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2002

Citing 17
Cited 5

The design and analysis of spatial data structures

The design and analysis of spatial data structures
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
The hB-tree: a multiattribute indexing method with good guaranteed performance

ACM Transactions on Database Systems (TODS)
A retrieval technique for similar shapes

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Multimedia Information Systems: The Unfolding of a Reality

Computer
Spatial joins using seeded trees

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Partition based spatial-merge join

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The Grid File: An Adaptable, Symmetric Multikey File Structure

ACM Transactions on Database Systems (TODS)
Multidimensional binary search trees used for associative searching

Communications of the ACM
The K-D-B-tree: a search structure for large multidimensional dynamic indexes

SIGMOD '81 Proceedings of the 1981 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Metric space similarity joins

ACM Transactions on Database Systems (TODS)
Mining temporal interval relational rules from temporal data

Journal of Systems and Software
Real-time segmenting time series data

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Closest pair queries with spatial constraints

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the $\epsilon$ tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence, the proposed index structure scales to high-dimensional data. We analyze the cost of the join for the $\epsilon$ tree and the R-tree family, and show that the $\epsilon$ tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life data sets, shows that similarity join using the $\epsilon$ tree is twice to an order of magnitude faster than the $R^+$ tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the $\epsilon$ tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the $\epsilon$ tree.