The String-to-String Correction Problem
Journal of the ACM (JACM)
ACM Computing Surveys (CSUR)
Introduction to Algorithms
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
The string edit distance matching problem with moves
ACM Transactions on Algorithms (TALG)
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Low distortion embeddings for edit distance
Journal of the ACM (JACM)
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Efficient interactive fuzzy keyword search
Proceedings of the 18th international conference on World wide web
Approximating edit distance in near-linear time
Proceedings of the forty-first annual ACM symposium on Theory of computing
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Incremental maintenance of length normalized indexes for approximate string matching
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Extending autocompletion to tolerate errors
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Foundations and Trends in Databases
Efficient exact edit similarity query processing with the asymmetric signature scheme
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient duplicate detection on cloud using a new signature scheme
WAIM'11 Proceedings of the 12th international conference on Web-age information management
Efficient similarity search: arbitrary similarity measures, arbitrary composition
Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Flexible and efficient distributed resolution of large entities
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
SIMP: accurate and efficient near neighbor search in high dimensional spaces
Proceedings of the 15th International Conference on Extending Database Technology
Efficient range queries over uncertain strings
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient edit distance based string similarity search using deletion neighborhoods
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Approximate string matching by position restricted alignment
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient top-k algorithms for approximate substring matching
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Asymmetric signature schemes for efficient exact edit similarity query processing
ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Efficient processing of graph similarity queries with edit distance constraints
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.