Efficient exact edit similarity query processing with the asymmetric signature scheme

Authors:
Jianbin Qin;Wei Wang;Yifei Lu;Chuan Xiao;Xuemin Lin
Affiliations:
University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales & East Normal China University, Sydney, Australia
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 34
Cited 12

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The String-to-String Correction Problem

Journal of the ACM (JACM)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

IEEE Transactions on Knowledge and Data Engineering
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Making the Pyramid Technique Robust to Query Types and Workloads

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Tandem repeats over the edit distance

Bioinformatics
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Extending autocompletion to tolerate errors

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Indexing Variable Length Substrings for Exact and Approximate Matching

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Similarity search on Bregman divergence: towards non-metric indexing

Proceedings of the VLDB Endowment
Similarity join in metric spaces

ECIR'03 Proceedings of the 25th European conference on IR research
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient and effective similarity search over probabilistic data based on earth mover's distance

Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment

Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Supporting efficient top-k queries in type-ahead search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Scalable and domain-independent entity coreference: establishing high quality data linkages across heterogeneous data sources

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Trie-based similarity search and join

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Cache-aware parallel approximate matching and join algorithms using BWT

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
String similarity measures and joins with synonyms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Efficient error-tolerant query autocompletion

Proceedings of the VLDB Endowment
Efficient processing of graph similarity queries with edit distance constraints

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities. In this paper, we show that the minimum signature size lower bound is t +1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study involving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.