Efficient similarity search in very large string sets

Authors:
Dandy Fenz;Dustin Lange;Astrid Rheinländer;Felix Naumann;Ulf Leser
Affiliations:
Hasso Plattner Institute, Potsdam, Germany;Hasso Plattner Institute, Potsdam, Germany;Department of Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany;Hasso Plattner Institute, Potsdam, Germany;Department of Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany
Venue:
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Year:
2012

Citing 18
Cited 1

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Trie memory

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Tries for Approximate String Matching

IEEE Transactions on Knowledge and Data Engineering
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Data Mining and Knowledge Discovery
Effective Indices for Efficient Approximate String Search and Similarity Join

WAIM '08 Proceedings of the 2008 The Ninth International Conference on Web-Age Information Management
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Fast error-tolerant search on very large texts

Proceedings of the 2009 ACM symposium on Applied Computing
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Finite automata and their decision problems

IBM Journal of Research and Development
Prefix tree indexing for similarity search and similarity joins on genomic data

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Using prefix-trees for efficiently computing set joins

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Scalable sequence similarity search and join in main memory on multi-cores

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2

Trying to outperform a well-known index with a sequential scan

Proceedings of the Joint EDBT/ICDT 2013 Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is defined using a similarity (or distance) measure, such as edit distance or Hamming distance. In this paper, we introduce the State Set Index (SSI) as an efficient solution for this search problem. SSI is based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton. SSI implements a novel state labeling strategy making the index highly space-efficient. Furthermore, SSI's space consumption can be gracefully traded against search time. We evaluated SSI on different sets of person names with up to 170 million strings from a social network and compared it to other state-of-the-art methods. We show that in the majority of cases, SSI is significantly faster than other tools and requires less index space.