GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces

Authors:
Jens-Peter Dittrich;Bernhard Seeger
Affiliations:
University of Marburg;University of Marburg
Venue:
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2001

Citing 21
Cited 15

Spatial query processing in an object-oriented database system

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
The design and analysis of spatial data structures

The design and analysis of spatial data structures
The query by image content (QBIC) system

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Spatial hash-joins

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Partition based spatial-merge join

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Size separation spatial join

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A cost model for nearest neighbor search in high-dimensional data space

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Self-spacial join selectivity estimation using fractal concepts

ACM Transactions on Information Systems (TOIS)
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Spatial join selectivity using power laws

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
javax.XXL: a prototype for a library of query processing algorithms

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
High performance clustering based on the similarity join

Proceedings of the ninth international conference on Information and knowledge management
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

IEEE Transactions on Knowledge and Data Engineering
High-Dimensional Similarity Joins

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
A Cost Model and Index Architecture for the Similarity Join

Proceedings of the 17th International Conference on Data Engineering
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries

Proceedings of the 27th International Conference on Very Large Data Bases
An Algorithm for Computing the Overlay of k-Dimensional Spaces

SSD '91 Proceedings of the Second International Symposium on Advances in Spatial Databases
Data Redundancy and Duplicate Detection in Spatial Join Processing

ICDE '00 Proceedings of the 16th International Conference on Data Engineering

XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries

Proceedings of the 27th International Conference on Very Large Data Bases
On producing join results early

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Hypercube sweeping algorithm for subsequence motion matching in large motion databases

Proceedings of the 2006 ACM international conference on Virtual reality continuum and its applications
Progressive merge join: a generic and non-blocking sort-based join algorithm

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Gorder: an efficient method for KNN join processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Metric space similarity joins

ACM Transactions on Database Systems (TODS)
A sorting approach to indexing spatial data

ACM SIGGRAPH 2008 classes
Indexing Moving Objects Using Short-Lived Throwaway Indexes

SSTD '09 Proceedings of the 11th International Symposium on Advances in Spatial and Temporal Databases
Similarity joins as stronger metric operations

SIGSPATIAL Special
Predicate-based indexing for desktop search

The VLDB Journal — The International Journal on Very Large Data Bases
Sorting in space: multidimensional, spatial, and metric data structures for computer graphics applications

ACM SIGGRAPH ASIA 2010 Courses
MOVIES: indexing moving objects by shooting index images

Geoinformatica
VA-files vs. r*-trees in distance join queries

ADBIS'05 Proceedings of the 9th East European conference on Advances in Databases and Information Systems
Indexing methods for moving object databases: games and other applications

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Super-EGO: fast multi-dimensional similarity join

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The similarity join is an important operation for mining high-dimensional feature spaces. Given two data sets, the similarity join computes all tuples (x, y) that are within a distance &egr;.One of the most efficient algorithms for processing similarity-joins is the Multidimensional-Spatial Join (MSJ) by Koudas and Sevcik. In our previous work --- pursued for the two-dimensional case --- we found however that MSJ has several performance shortcomings in terms of CPU and I/O cost as well as memory-requirements. Therefore, MSJ is not generally applicable to high-dimensional data.In this paper, we propose a new algorithm named Generic External Space Sweep (GESS). GESS introduces a modest rate of data replication to reduce the number of expensive distance computations. We present a new cost-model for replication, an I/O model, and an inexpensive method for duplicate removal. The principal component of our algorithm is a highly flexible replication engine.Our analytical model predicts a tremendous reduction of the number of expensive distance computations by several orders of magnitude in comparison to MSJ (factor 107). In addition, the memory requirements of GESS are shown to be lower by several orders of magnitude. Furthermore, the I/O cost of our algorithm is by factor 2 better (independent from the fact whether replication occurs or not). Our analytical results are confirmed by a large series of simulations and experiments with synthetic and real high-dimensional data sets.