High-Dimensional Similarity Joins

Authors:
Kyuseok Shim;Ramakrishnan Srikant;Rakesh Agrawal
Affiliations:
-;-;-
Venue:
ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Year:
1997

Citing 0
Cited 36

Scalable algorithms for mining large databases

KDD '99 Tutorial notes of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Spatial join selectivity using power laws

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Time series similarity measures (tutorial PM-2)

Tutorial notes of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
High performance clustering based on the similarity join

Proceedings of the ninth international conference on Information and knowledge management
Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Prefix-querying: an approach for effective subsequence matching under time warping in sequence databases

Proceedings of the tenth international conference on Information and knowledge management
Shape-based retrieval of similar subsequences in time-series databases

Proceedings of the 2002 ACM symposium on Applied computing
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

IEEE Transactions on Knowledge and Data Engineering
A Survey of Temporal Knowledge Discovery Paradigms and Methods

IEEE Transactions on Knowledge and Data Engineering
Parallel Algorithms for High-dimensional Similarity Joins for Data Mining Applications

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Approximate Algorithms for Distance-Based Queries in High-Dimensional Data Spaces Using R-Trees

ADBIS '02 Proceedings of the 6th East European Conference on Advances in Databases and Information Systems
Optimal Dimension Order: A Generic Technique for the Similarity Join

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Partition-Based Similarity Join in High Dimensional Data Spaces

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
On producing join results early

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An Efficient Parallel Algorithm for High Dimensional Similarity Join

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Efficient similarity-based operations for data integration

Data & Knowledge Engineering
Efficient processing of similarity search under time warping in sequence databases: an index-based approach

Information Systems - Databases: Creation, management and utilization
Integrating XML data sources using approximate joins

ACM Transactions on Database Systems (TODS)
Shape-based retrieval in time-series databases

Journal of Systems and Software
Fast similarity join for multi-dimensional data

Information Systems
Efficient index-based KNN join processing for high-dimensional data

Information and Software Technology
An empirical study on selective partitioning dimensions for partition-based similarity joins

Data & Knowledge Engineering
Progressive merge join: a generic and non-blocking sort-based join algorithm

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Metric space similarity joins

ACM Transactions on Database Systems (TODS)
Using similarity-based operations for resolving data-level conflicts

BNCOD'03 Proceedings of the 20th British national conference on Databases
Optimization of joins using random record generation method

Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India
Similarity joins as stronger metric operations

SIGSPATIAL Special
Probabilistic similarity join on uncertain data

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
VA-files vs. r*-trees in distance join queries

ADBIS'05 Proceedings of the 9th East European conference on Advances in Databases and Information Systems
Partition-Based similarity joins using diagonal dimensions in high dimensional data spaces

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Progressive high-dimensional similarity join

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Super-EGO: fast multi-dimensional similarity join

The VLDB Journal — The International Journal on Very Large Data Bases
OCOG: A common grasp computation algorithm for a set of planar objects

Robotics and Computer-Integrated Manufacturing
DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the epsilon-kdB tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence the proposed index structure scales to high-dimensional data. Empirical evaluation, using synthetic and real-life datasets, shows that similarity join using the epsilon-kdB tree is 2 to an order of magnitude faster than the R+ tree, with the performance gap increasing with the number of dimensions.