Asymmetric signature schemes for efficient exact edit similarity query processing

Authors:
Jianbin Qin;Wei Wang;Chuan Xiao;Yifei Lu;Xuemin Lin;Haixun Wang
Affiliations:
University of New South Wales, Sydney, Australia;University of New South Wales and Microsoft Research Asia, Sydney, Australia;Nagoya University, Nagoya, Japan;University of New South Wales, Sydney, Australia;University of New South Wales and East China Normal University, Sydney, Australia;Microsoft Research Asia, Beijing, China
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2013

Citing 51
Cited 0

Algorithms for approximate string matching

Information and Control
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The String-to-String Correction Problem

Journal of the ACM (JACM)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

IEEE Transactions on Knowledge and Data Engineering
Computation of Normalized Edit Distance and Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence
Lower bounds for embedding edit distance into normed spaces

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Making the Pyramid Technique Robust to Query Types and Workloads

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Tandem repeats over the edit distance

Bioinformatics
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Extending autocompletion to tolerate errors

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
Indexing Variable Length Substrings for Exact and Approximate Matching

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Similarity search on Bregman divergence: towards non-metric indexing

Proceedings of the VLDB Endowment
Similarity join in metric spaces

ECIR'03 Proceedings of the 25th European conference on IR research
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient and effective similarity search over probabilistic data based on earth mover's distance

Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Approximate String Processing

Foundations and Trends in Databases
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Answering approximate string queries on large data sets using external memory

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Fast-join: An efficient method for fuzzy token matching based string similarity join

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures. In this article, we show that for any such signature scheme, the lower bound of the minimum number of signatures is τ + 1, which is lower than what is achieved by existing methods. We then propose several asymmetric signature schemes, that is, extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching q-chunks and q-grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating q-chunks, optimal selection of q-chunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.