Can we beat the prefix filtering?: an adaptive framework for similarity join and search

Authors:
Jiannan Wang;Guoliang Li;Jianhua Feng
Affiliations:
Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Year:
2012

Citing 28
Cited 8

A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Estimating the selectivity of approximate string queries

ACM Transactions on Database Systems (TODS)
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Metric space similarity joins

ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Top-k Set Similarity Joins

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Similarity join size estimation using locality sensitive hashing

Proceedings of the VLDB Endowment
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast-join: An efficient method for fuzzy token matching based string similarity join

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment

Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient edit distance based string similarity search using deletion neighborhoods

Proceedings of the Joint EDBT/ICDT 2013 Workshops
String similarity measures and joins with synonyms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Tuning large scale deduplication with reduced effort

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
RCSI: scalable similarity search in thousand(s) of genomes

Proceedings of the VLDB Endowment
Scalable column concept determination for web tables using large knowledge bases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

As two important operations in data cleaning, similarity join and similarity search have attracted much attention recently. Existing methods to support similarity join usually adopt a prefix-filtering-based framework. They select a prefix of each object and prune object pairs whose prefixes have no overlap. We have an observation that prefix lengths have significant effect on the performance. Different prefix lengths lead to significantly different performance, and prefix filtering does not always achieve high performance. To address this problem, in this paper we propose an adaptive framework to support similarity join. We propose a cost model to judiciously select an appropriate prefix for each object. To efficiently select prefixes, we devise effective indexes. We extend our method to support similarity search. Experimental results show that our framework beats the prefix-filtering-based framework and achieves high efficiency.