Compact Similarity Joins

Authors:
Brent Bryan;Frederick Eberhardt;Christos Faloutsos
Affiliations:
Machine Learning Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA. bryanba@cs.cmu.edu;Department of Philosophy, University of California, Berkeley, 314 Moses Hall #2390, Berkeley, CA 94720, USA. fde@berkeley.edu;Machine Learning Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA. christos@cs.cmu.edu
Venue:
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Year:
2008

Citing 0
Cited 3

Generalizing prefix filtering to improve set similarity joins

Information Systems
Large-scale similarity-based join processing in multimedia databases

MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
Super-EGO: fast multi-dimensional similarity join

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity joins have attracted significant interest, with applications in Geographical Information Systems, astronomy, marketing analyzes, and anomaly detection. However, all the past algorithms, although highly fine-tuned, suffer an output explosion if the query range is even moderately large relative to the local data density. Under such circumstances, the response time and the search effort are both almost quadratic in the database size, which is often prohibitive. We solve this problem by providing two algorithms that find a compact representation of the similarity join result, while retaining all the information in the standard join. Our algorithms have the following characteristics: (a) they are at least as fast as the standard similarity join algorithm, and typically much faster, (b) they generate significantly smaller output, (c) they provably lose no information, (d) they scale well to large data sets, and (e) they can be applied to any of the standard tree data structures. Experiments on real and realistic point-sets show that our algorithms are up to several orders of magnitude faster.