Super-EGO: fast multi-dimensional similarity join

  • Authors:
  • Dmitri V. Kalashnikov

  • Affiliations:
  • Department of Computer Science, University of California, Irvine, USA

  • Venue:
  • The VLDB Journal — The International Journal on Very Large Data Bases
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficient processing of high-dimensional similarity joins plays an important role for a wide variety of data-driven applications. In this paper, we consider $$\varepsilon $$ -join variant of the problem. Given two $$d$$ -dimensional datasets and parameter $$\varepsilon $$ , the task is to find all pairs of points, one from each dataset that are within $$\varepsilon $$ distance from each other. We propose a new $$\varepsilon $$ -join algorithm, called Super-EGO, which belongs the EGO family of join algorithms. The new algorithm gains its advantage by using novel data-driven dimensionality re-ordering technique, developing a new EGO-strategy that more aggressively avoids unnecessary computation, as well as by developing a parallel version of the algorithm. We study the newly proposed Super-EGO algorithm on large real and synthetic datasets. The empirical study demonstrates significant advantage of the proposed solution over the existing state of the art techniques.