Algorithms for approximate string matching
Information and Control
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
The String-to-String Correction Problem
Journal of the ACM (JACM)
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
R-trees: a dynamic index structure for spatial searching
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
High Dimensional Similarity Joins: Algorithms and Performance Evaluation
IEEE Transactions on Knowledge and Data Engineering
Computation of Normalized Edit Distance and Applications
IEEE Transactions on Pattern Analysis and Machine Intelligence
Lower bounds for embedding edit distance into normed spaces
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient Index Structures for String Databases
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Making the Pyramid Technique Robust to Query Types and Workloads
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Locality-sensitive hashing scheme based on p-stable distributions
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development - Mathematics and computing
LSH forest: self-tuning indexes for similarity search
WWW '05 Proceedings of the 14th international conference on World Wide Web
Finding similar files in large document repositories
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Tandem repeats over the edit distance
Bioinformatics
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Principles of hash-based text retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Multi-probe LSH: efficient indexing for high-dimensional similarity search
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Extending autocompletion to tolerate errors
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Introduction to Algorithms, Third Edition
Introduction to Algorithms, Third Edition
Indexing Variable Length Substrings for Exact and Approximate Matching
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Efficient approximate search on string collections
Proceedings of the VLDB Endowment
Similarity search on Bregman divergence: towards non-metric indexing
Proceedings of the VLDB Endowment
Similarity join in metric spaces
ECIR'03 Proceedings of the 25th European conference on IR research
Proceedings of the 19th international conference on World wide web
Bed-tree: an all-purpose index structure for string similarity search based on edit distance
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient and effective similarity search over probabilistic data based on earth mover's distance
Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints
Proceedings of the VLDB Endowment
Foundations and Trends in Databases
WHAM: a high-throughput sequence alignment method
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Answering approximate string queries on large data sets using external memory
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Fast-join: An efficient method for fuzzy token matching based string similarity join
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures. In this article, we show that for any such signature scheme, the lower bound of the minimum number of signatures is τ + 1, which is lower than what is achieved by existing methods. We then propose several asymmetric signature schemes, that is, extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching q-chunks and q-grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating q-chunks, optimal selection of q-chunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.