Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A fast bit-vector algorithm for approximate string matching based on dynamic programming
Journal of the ACM (JACM)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
Journal of the ACM (JACM)
Communications of the ACM
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Tries for Approximate String Matching
IEEE Transactions on Knowledge and Data Engineering
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
Data Mining and Knowledge Discovery
Effective Indices for Efficient Approximate String Search and Similarity Join
WAIM '08 Proceedings of the 2008 The Ninth International Conference on Web-Age Information Management
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Fast error-tolerant search on very large texts
Proceedings of the 2009 ACM symposium on Applied Computing
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Finite automata and their decision problems
IBM Journal of Research and Development
Prefix tree indexing for similarity search and similarity joins on genomic data
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Using prefix-trees for efficiently computing set joins
DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Scalable sequence similarity search and join in main memory on multi-cores
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Trying to outperform a well-known index with a sequential scan
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Hi-index | 0.00 |
String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is defined using a similarity (or distance) measure, such as edit distance or Hamming distance. In this paper, we introduce the State Set Index (SSI) as an efficient solution for this search problem. SSI is based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton. SSI implements a novel state labeling strategy making the index highly space-efficient. Furthermore, SSI's space consumption can be gracefully traded against search time. We evaluated SSI on different sets of person names with up to 170 million strings from a social network and compared it to other state-of-the-art methods. We show that in the majority of cases, SSI is significantly faster than other tools and requires less index space.