A partition-based method for string similarity joins with edit-distance constraints

Authors:
Guoliang Li;Dong Deng;Jianhua Feng
Affiliations:
Tsinghua University, China;Tsinghua University, China;Tsinghua University, China
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2013

Citing 40
Cited 0

Algorithms for approximate string matching

Information and Control
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Extending q-grams to estimate selectivity of string matching with low edit distance

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Metric space similarity joins

ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
An efficient filter for approximate membership checking

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SEPIA: estimating selectivities of approximate string predicates in large Databases

The VLDB Journal — The International Journal on Very Large Data Bases
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Scalable ad-hoc entity extraction from text collections

Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Top-k Set Similarity Joins

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Incremental maintenance of length normalized indexes for approximate string matching

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
Probabilistic string similarity joins

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Similarity join size estimation using locality sensitive hashing

Proceedings of the VLDB Endowment
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Answering approximate string queries on large data sets using external memory

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Fast-join: An efficient method for fuzzy token matching based string similarity join

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Trie-join: a trie-based method for efficient string similarity joins

The VLDB Journal — The International Journal on Very Large Data Bases
Supporting Search-As-You-Type Using SQL in Databases

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this article, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a new filter, called the segment filter. We partition a string into a set of segments and use the segments as a filter to find similar string pairs. We first create inverted indices for the segments. Then for each string, we select some of its substrings, identify the selected substrings from the inverted indices, and take strings on the inverted lists of the found substrings as candidates of this string. Finally, we verify the candidates to generate the final answer. We devise efficient techniques to select substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidates. We also extend our techniques to support normalized edit distance. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real-world datasets.