Similarity join size estimation using locality sensitive hashing

Authors:
Hongrae Lee;Raymond T. Ng;Kyuseok Shim
Affiliations:
University of British Columbia;University of British Columbia;Seoul National University
Venue:
Proceedings of the VLDB Endowment
Year:
2011

Citing 13
Cited 4

Randomized algorithms

Randomized algorithms
Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment

Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
String similarity measures and joins with synonyms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Similarity queries: their conceptual evaluation, transformations, and processing

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the join size can change dramatically depending on the input similarity threshold. We propose a sampling based algorithm that uses Locality-Sensitive-Hashing (LSH). The proposed algorithm LSH-SS uses an LSH index to enable effective sampling even at high thresholds. We compare the proposed technique with random sampling and the state-of-the-art technique for SSJ (adapted to VSJ) and demonstrate LSH-SS offers more accurate estimates throughout the similarity threshold range and small variance using real-world data sets.