Similarity joins as stronger metric operations

Authors:
Wei Wang
Affiliations:
University of New South Wales, Australia
Venue:
SIGSPATIAL Special
Year:
2010

Citing 21
Cited 1

Size separation spatial join

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
High performance clustering based on the similarity join

Proceedings of the ninth international conference on Information and knowledge management
GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

IEEE Transactions on Knowledge and Data Engineering
High-Dimensional Similarity Joins

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
An Algorithm for Computing the Overlay of k-Dimensional Spaces

SSD '91 Proceedings of the Second International Symposium on Advances in Spatial Databases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
List of Twin Clusters: A Data Structure for Similarity Joins in Metric Spaces

SISAP '08 Proceedings of the First International Workshop on Similarity Search and Applications (sisap 2008)
Metric space similarity joins

ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapDupReducer: detecting near duplicates over massive datasets

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Trie-based similarity search and join

Proceedings of the Joint EDBT/ICDT 2013 Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity joins between two sets of records return pairs of records whose similarity is no less than a given threshold. More specifically, consider two sets of records, R and S, a similarity function sim(.,.) and a threshold t, a similarity join between R and S is defined as { (r, s) | (r, s) ∈ R x S, sim(r, s) ≥ t }. A similarity join is a generalization of the traditional equality join commonly found in database systems. A variant of the similarity join is to use a distance threshold to replace the similarity threshold. It is generally expected that the similarity threshold is close to the maximum possible value (usually 1.0), and the distance threshold is close to the minimum possible value (usually 0). For example, we may find near-duplicate documents in a document repository using a cosine similarity threshold of 0.9, or we may find pairs of incorrectly spelt queries and their correct versions in a query log with an edit distance threshold of 2.