Randomized algorithms
Bifocal sampling for skew-resistant join size estimation
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Tracking join and self-join sizes in limited storage
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size
Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
String similarity measures and joins with synonyms
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Similarity queries: their conceptual evaluation, transformations, and processing
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the join size can change dramatically depending on the input similarity threshold. We propose a sampling based algorithm that uses Locality-Sensitive-Hashing (LSH). The proposed algorithm LSH-SS uses an LSH index to enable effective sampling even at high thresholds. We compare the proposed technique with random sampling and the state-of-the-art technique for SSJ (adapted to VSJ) and demonstrate LSH-SS offers more accurate estimates throughout the similarity threshold range and small variance using real-world data sets.