Efficient exact set-similarity joins

Authors:
Arvind Arasu;Venkatesh Ganti;Raghav Kaushik
Affiliations:
Microsoft Research, One Microsoft Way, Redmond, WA;Microsoft Research, One Microsoft Way, Redmond, WA;Microsoft Research, One Microsoft Way, Redmond, WA
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 19
Cited 111

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Set Containment Joins: The Good, The Bad and The Ugly

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Adaptive algorithms for set containment joins

ACM Transactions on Database Systems (TODS)
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Efficient processing of joins on set-valued attributes

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
ConQuer: efficient management of inconsistent databases

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Management of data with uncertainties

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
An efficient filter for approximate membership checking

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Incorporating string transformations in record matching

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Mining taxonomies of process models

Data & Knowledge Engineering
Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Evaluating Performance and Quality of XML-Based Similarity Joins

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Scalable ad-hoc entity extraction from text collections

Proceedings of the VLDB Endowment
Scalable mining of large video databases using copy detection

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Fast Content-Based Mining of Web2.0 Videos

PCM '08 Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Approximate substring selectivity estimation

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Exploiting web search to generate synonyms for entities

Proceedings of the 18th international conference on World wide web
Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Effective Similarity Analysis over Event Streams Based on Sharing Extent

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient top-k algorithms for fuzzy search in string collections

Proceedings of the First International Workshop on Keyword Search on Structured Data
Incremental maintenance of length normalized indexes for approximate string matching

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Extending autocompletion to tolerate errors

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Set Similarity Joins Using Min-prefixes

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Fast Matching for All Pairs Similarity Search

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
Space-economical partial gram indices for exact substring matching

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient algorithms for approximate member extraction using signature-based inverted lists

Proceedings of the 18th ACM conference on Information and knowledge management
A framework for semantic link discovery over relational data

Proceedings of the 18th ACM conference on Information and knowledge management
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Comparative evaluation of entity resolution approaches with FEVER

Proceedings of the VLDB Endowment
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
Incremental all pairs similarity search for varying similarity thresholds

Proceedings of the 3rd Workshop on Social Network Mining and Analysis
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On indexing error-tolerant set containment

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Similarity joins as stronger metric operations

SIGSPATIAL Special
Generalizing prefix filtering to improve set similarity joins

Information Systems
Efficient set-correlation operator inside databases

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
CasJoin: a cascade chain for text similarity joins

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Duplicate identification in deep web data integration

WAIM'10 Proceedings of the 11th international conference on Web-age information management
An efficient similarity join algorithm with cosine similarity predicate

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Scaling up top-K cosine similarity search

Data & Knowledge Engineering
Set similarity join on probabilistic data

Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Approximate entity extraction in temporal databases

World Wide Web
Approximate String Processing

Foundations and Trends in Databases
Similarity join size estimation using locality sensitive hashing

Proceedings of the VLDB Endowment
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Batch text similarity search with MapReduce

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
PG-Skip: proximity graph based clustering of long strings

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
A supervised machine learning approach for duplicate detection over gazetteer records

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Entity matching: how similar is similar

Proceedings of the VLDB Endowment
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Continuously monitoring the correlations of massive discrete streams

Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
SpSJoin: parallel spatial similarity joins

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Information Sciences: an International Journal
Towards a similarity metric for comparing machine-readable privacy policies

iNetSec'11 Proceedings of the 2011 IFIP WG 11.4 international conference on Open Problems in Network Security
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
Seal: spatio-textual similarity search

Proceedings of the VLDB Endowment
An optimized in-network aggregation scheme for data collection in periodic sensor networks

ADHOC-NOW'12 Proceedings of the 11th international conference on Ad-hoc, Mobile, and Wireless Networks
WHAM: A High-Throughput Sequence Alignment Method

ACM Transactions on Database Systems (TODS)
Recommendations using linked data

Proceedings of the 5th Ph.D. workshop on Information and knowledge
Star-Join: spatio-textual similarity join

Proceedings of the 21st ACM international conference on Information and knowledge management
Landmark-join: hash-join based string similarity joins with edit distance constraints

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
Set-Similarity joins based semi-supervised sentiment analysis

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part I
Of cubes, DAGs and hierarchical correlations: a novel conceptual model for analyzing social media data

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Spatio-textual similarity joins

Proceedings of the VLDB Endowment
Indexing dataspaces with partitions

World Wide Web
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining
TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration

Proceedings of the sixth ACM international conference on Web search and data mining
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Approximate string matching by position restricted alignment

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Cache-aware parallel approximate matching and join algorithms using BWT

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
PartSS: an efficient partition-based filtering for edit distance constraints

ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
String similarity measures and joins with synonyms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
HmSearch: an efficient hamming distance query processing algorithm

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Optimal hashing schemes for entity matching

Proceedings of the 22nd international conference on World Wide Web
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
Similarity in languages and programs

Theoretical Computer Science
Entity resolution on uncertain relations

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Scalable column concept determination for web tables using large knowledge bases

Proceedings of the VLDB Endowment
Dimension independent similarity computation

The Journal of Machine Learning Research
Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

Fundamenta Informaticae - To Andrzej Skowron on His 70th Birthday

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs of sets, one from each collection, that have high similarity. Recent work has identified SSJoin as a useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our algorithms have two important features: They are exact, i.e., they always produce the correct answer, and they carry precise performance guarantees. We believe our algorithms are the first to have both features; previous algorithms with performance guarantees are only probabilistically approximate. We demonstrate the effectiveness of our algorithms using a thorough experimental evaluation over real-life and synthetic data sets.