A Primitive Operator for Similarity Joins in Data Cleaning

Authors:
Surajit Chaudhuri;Venkatesh Ganti;Raghav Kaushik
Affiliations:
Microsoft Research;Microsoft Research;Microsoft Research
Venue:
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Year:
2006

Citing 0
Cited 100

Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
An efficient filter for approximate membership checking

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Evaluating Performance and Quality of XML-Based Similarity Joins

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Time-completeness trade-offs in record linkage using adaptive query processing

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Exploiting web search to generate synonyms for entities

Proceedings of the 18th international conference on World wide web
Fast error-tolerant search on very large texts

Proceedings of the 2009 ACM symposium on Applied Computing
Incremental maintenance of length normalized indexes for approximate string matching

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Extending autocompletion to tolerate errors

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Set Similarity Joins Using Min-prefixes

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
A cluster-based approach to XML similarity joins

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Efficient algorithms for approximate member extraction using signature-based inverted lists

Proceedings of the 18th ACM conference on Information and knowledge management
Incremental similarity joins with edit distance constraints

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Mining document collections to facilitate accurate approximate entity matching

Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
Declarative XML data cleaning with XClean

CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
Graph-based concept identification and disambiguation for enterprise search

Proceedings of the 19th international conference on World wide web
Similarity joins of text with incomplete information formats

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
An approach for XML similarity join using tree serialization

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Probabilistic string similarity joins

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On indexing error-tolerant set containment

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
SimDB: a similarity-aware database system

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Similarity joins as stronger metric operations

SIGSPATIAL Special
Generalizing prefix filtering to improve set similarity joins

Information Systems
Efficient set-correlation operator inside databases

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
CasJoin: a cascade chain for text similarity joins

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
An efficient similarity join algorithm with cosine similarity predicate

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Scaling up top-K cosine similarity search

Data & Knowledge Engineering
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
Set similarity join on probabilistic data

Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Approximate entity extraction in temporal databases

World Wide Web
Approximate String Processing

Foundations and Trends in Databases
Similarity join size estimation using locality sensitive hashing

Proceedings of the VLDB Endowment
Automatic discovery of attributes in relational databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Batch text similarity search with MapReduce

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Entity matching: how similar is similar

Proceedings of the VLDB Endowment
PG-join: proximity graph based string similarity joins

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Information Sciences: an International Journal
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Exploiting MapReduce-based similarity joins

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
Seal: spatio-textual similarity search

Proceedings of the VLDB Endowment
MapReduce-based similarity join for metric spaces

Proceedings of the 1st International Workshop on Cloud Intelligence
Supporting efficient top-k queries in type-ahead search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
CrowdER: crowdsourcing entity resolution

Proceedings of the VLDB Endowment
An optimized in-network aggregation scheme for data collection in periodic sensor networks

ADHOC-NOW'12 Proceedings of the 11th international conference on Ad-hoc, Mobile, and Wireless Networks
Exploiting database similarity joins for metric spaces

Proceedings of the VLDB Endowment
Proximity search of XML data using ontology and XPath edit similarity

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Matching product titles using web-based enrichment

Proceedings of the 21st ACM international conference on Information and knowledge management
Star-Join: spatio-textual similarity join

Proceedings of the 21st ACM international conference on Information and knowledge management
Landmark-join: hash-join based string similarity joins with edit distance constraints

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Of cubes, DAGs and hierarchical correlations: a novel conceptual model for analyzing social media data

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Spatio-textual similarity joins

Proceedings of the VLDB Endowment
Indexing dataspaces with partitions

World Wide Web
TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration

Proceedings of the sixth ACM international conference on Web search and data mining
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Approximate string matching by position restricted alignment

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Trie-based similarity search and join

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)
PartSS: an efficient partition-based filtering for edit distance constraints

ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
String similarity measures and joins with synonyms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Tuning large scale deduplication with reduced effort

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Scalable all-pairs similarity search in metric spaces

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal hashing schemes for entity matching

Proceedings of the 22nd international conference on World Wide Web
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Similarity queries: their conceptual evaluation, transformations, and processing

The VLDB Journal — The International Journal on Very Large Data Bases
Discovering influential authors in heterogeneous academic networks by a co-ranking method

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A two-phase algorithm for mining sequential patterns with differential privacy

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
Entity resolution on uncertain relations

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Dimension independent similarity computation

The Journal of Machine Learning Research
Efficient processing of graph similarity queries with edit distance constraints

The VLDB Journal — The International Journal on Very Large Data Bases
Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

Fundamenta Informaticae - To Andrzej Skowron on His 70th Birthday

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data cleaning based on similarities involves identification of "close" tuples, where closeness is evaluated using a variety of similarity functions chosen to suit the domain and application. Current approaches for efficiently implementing such similarity joins are tightly tied to the chosen similarity function. In this paper, we propose a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity. We then propose efficient implementations for this operator. In an experimental evaluation using real datasets, we show that the implementation of similarity joins using our operator is comparable to, and often substantially better than, previous customized implementations for particular similarity functions.