SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
High performance clustering based on the similarity join
Proceedings of the ninth international conference on Information and knowledge management
GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
High Dimensional Similarity Joins: Algorithms and Performance Evaluation
IEEE Transactions on Knowledge and Data Engineering
High-Dimensional Similarity Joins
ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
An Algorithm for Computing the Overlay of k-Dimensional Spaces
SSD '91 Proceedings of the Second International Symposium on Advances in Spatial Databases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
List of Twin Clusters: A Data Structure for Similarity Joins in Metric Spaces
SISAP '08 Proceedings of the First International Workshop on Similarity Search and Applications (sisap 2008)
ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapDupReducer: detecting near duplicates over massive datasets
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Trie-based similarity search and join
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Hi-index | 0.00 |
Similarity joins between two sets of records return pairs of records whose similarity is no less than a given threshold. More specifically, consider two sets of records, R and S, a similarity function sim(.,.) and a threshold t, a similarity join between R and S is defined as { (r, s) | (r, s) ∈ R x S, sim(r, s) ≥ t }. A similarity join is a generalization of the traditional equality join commonly found in database systems. A variant of the similarity join is to use a distance threshold to replace the similarity threshold. It is generally expected that the similarity threshold is close to the maximum possible value (usually 1.0), and the distance threshold is close to the minimum possible value (usually 0). For example, we may find near-duplicate documents in a document repository using a cosine similarity threshold of 0.9, or we may find pairs of incorrectly spelt queries and their correct versions in a query log with an edit distance threshold of 2.