Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

Authors:
Christian Böhm;Bernhard Braunmüller;Florian Krebs;Hans-Peter Kriegel
Affiliations:
Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany;Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany;Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany;Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany
Venue:
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Year:
2001

Citing 27
Cited 24

A retrieval technique for similar shapes

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Efficient processing of spatial joins using R-trees

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Efficient and effective querying by image content

Journal of Intelligent Information Systems - Special issue: advances in visual information management systems
Spatial joins using seeded trees

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Spatial hash-joins

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Partition based spatial-merge join

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Size separation spatial join

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A cost model for nearest neighbor search in high-dimensional data space

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications

Data Mining and Knowledge Discovery
Approximation-Based Similarity Search for 3-D Surface Segments

Geoinformatica
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining

IEEE Transactions on Knowledge and Data Engineering
Improving the Query Performance of High-Dimensional Index Structures by Bulk-Load Operations

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
High-Dimensional Similarity Joins

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Parallel Processing of Spatial Joins Using R-trees

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
A Cost Model and Index Architecture for the Similarity Join

Proceedings of the 17th International Conference on Data Engineering
Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A Generic Approach to Bulk Loading Multidimensional Index Structures

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Hilbert R-tree: An Improved R-tree using Fractals

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Fast Nearest Neighbor Search in Medical Image Databases

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Discovery of Spatial Association Rules in Geographic Information Databases

SSD '95 Proceedings of the 4th International Symposium on Advances in Spatial Databases
Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases

ICDE '00 Proceedings of the 16th International Conference on Data Engineering

Optimal Dimension Order: A Generic Technique for the Similarity Join

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Partition-Based Similarity Join in High Dimensional Data Spaces

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
On producing join results early

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Integrating similarity-based queries in image DBMSs

Proceedings of the 2004 ACM symposium on Applied computing
Top-k Spatial Joins

IEEE Transactions on Knowledge and Data Engineering
An approximate algorithm for top-k closest pairs join query in large high dimensional data

Data & Knowledge Engineering
Fast similarity join for multi-dimensional data

Information Systems
Efficient index-based KNN join processing for high-dimensional data

Information and Software Technology
An empirical study on selective partitioning dimensions for partition-based similarity joins

Data & Knowledge Engineering
Adaptable similarity search using non-relevant information

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Progressive merge join: a generic and non-blocking sort-based join algorithm

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Gorder: an efficient method for KNN join processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Metric space similarity joins

ACM Transactions on Database Systems (TODS)
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Solving similarity joins and range queries in metric spaces with the list of twin clusters

Journal of Discrete Algorithms
Distance-join: pattern match query in a large graph database

Proceedings of the VLDB Endowment
SimDB: a similarity-aware database system

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
An efficient similarity join algorithm with cosine similarity predicate

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Answering pattern match queries in large graph databases via graph embedding

The VLDB Journal — The International Journal on Very Large Data Bases
Partition-Based similarity joins using diagonal dimensions in high dimensional data spaces

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
MapReduce-based similarity join for metric spaces

Proceedings of the 1st International Workshop on Cloud Intelligence
Progressive high-dimensional similarity join

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Spatio-textual similarity joins

Proceedings of the VLDB Endowment
Super-EGO: fast multi-dimensional similarity join

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter &egr;. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with cell length &egr; over the data space and comparing the grid cells lexicographically. A typical problem of grid-based approaches such as MSJ or the &egr;-kdB-tree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strategy during the join phase. In the experimental evaluation, a substantial improvement over competitive techniques is shown.