Super-EGO: fast multi-dimensional similarity join

Authors:
Dmitri V. Kalashnikov
Affiliations:
Department of Computer Science, University of California, Irvine, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2013

Citing 24
Cited 0

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Efficient processing of spatial joins using R-trees

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Spatial hash-joins

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Partition based spatial-merge join

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Incremental distance join algorithms for spatial databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Closest pair queries in spatial databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Fundamentals of Database Systems

Fundamentals of Database Systems
High-Dimensional Similarity Joins

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Parallel Algorithms for High-dimensional Similarity Joins for Data Mining Applications

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Similarity Join for Low-and High-Dimensional Data

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Fast similarity join for multi-dimensional data

Information Systems
Compact Similarity Joins

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A Fast Similarity Join Algorithm Using Graphics Processing Units

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Exploiting context analysis for combining multiple entity resolution systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

ACM Transactions on Database Systems (TODS)
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A unified approach for computing top-k pairs in multidimensional space

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Attribute and object selection queries on objects with probabilistic attributes

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient processing of high-dimensional similarity joins plays an important role for a wide variety of data-driven applications. In this paper, we consider $$\varepsilon $$ -join variant of the problem. Given two $$d$$ -dimensional datasets and parameter $$\varepsilon $$ , the task is to find all pairs of points, one from each dataset that are within $$\varepsilon $$ distance from each other. We propose a new $$\varepsilon $$ -join algorithm, called Super-EGO, which belongs the EGO family of join algorithms. The new algorithm gains its advantage by using novel data-driven dimensionality re-ordering technique, developing a new EGO-strategy that more aggressively avoids unnecessary computation, as well as by developing a parallel version of the algorithm. We study the newly proposed Super-EGO algorithm on large real and synthetic datasets. The empirical study demonstrates significant advantage of the proposed solution over the existing state of the art techniques.