Efficient Set Similarity Joins Using Min-prefixes

Authors:
Leonardo A. Ribeiro;Theo Härder
Affiliations:
AG DBIS, Department of Computer Science, University of Kaiserslautern, Germany;AG DBIS, Department of Computer Science, University of Kaiserslautern, Germany
Venue:
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Year:
2009

Citing 13
Cited 2

Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Evaluating similarity measures: a large-scale study in the orkut social network

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
An efficient filter for approximate membership checking

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Top-k Set Similarity Joins

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering

Generalizing prefix filtering to improve set similarity joins

Information Systems
Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraints

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identification of all objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most set similarity join methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to previous algorithms.