Indexing methods for approximate dictionary searching: Comparative analysis

Authors:
Leonid Boytsov
Affiliations:
North Bethesda, MD
Venue:
Journal of Experimental Algorithmics (JEA)
Year:
2011

Citing 68
Cited 3

Spatial data structures

Modern database systems
Finding approximate matches in large lexicons

Software—Practice & Experience
Dictionary organizations for efficient similarity retrieval

Journal of Systems and Software
Phonetic string matching: lessons from information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of approximate string matching algorithms

Software—Practice & Experience
An algorithm to align words for historical comparison

Computational Linguistics
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
Multidimensional access methods

ACM Computing Surveys (CSUR)
Multi-method dispatching: a geometric approach with applications to string matching problems

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
Reducing the space requirement of suffix trees

Software—Practice & Experience
Subword-based approaches for spoken document retrieval

Speech Communication
Text indexing and dictionary matching with one error

Journal of Algorithms
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Fast and flexible string matching by combining bit-parallelism and suffix automata

Journal of Experimental Algorithmics (JEA)
Searching in metric spaces

ACM Computing Surveys (CSUR)
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
NR-grep: a fast and flexible pattern-matching tool

Software—Practice & Experience
Searching Multimedia Databases by Content

Searching Multimedia Databases by Content
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Fixed Queries Array: A Fast and Economical Data Structure for Proximity Searching

Multimedia Tools and Applications
Probabilistic proximity search: fighting the curse of dimensionality in metric spaces

Information Processing Letters
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
t-Spanners as a Data Structure for Metric Space Searching

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Experiments on Adaptive Set Intersections for Text Retrieval Systems

ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
One-Gapped q-Gram Filtersfor Levenshtein Distance

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Approximate Multiple Strings Search

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Cuckoo Hashing

ESA '01 Proceedings of the 9th Annual European Symposium on Algorithms
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
A Fast Algorithm on Average for All-Against-All Sequence Matching

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Incremental construction of minimal acyclic finite-state automata

Computational Linguistics - Special issue on finite-state methods in NLP
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Comparing inverted files and signature files for searching a large lexicon

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
Speeding up whole-genome alignment by indexing frequency vectors

Bioinformatics
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Fast Approximate Search in Large Dictionaries

Computational Linguistics
Representing Trees of Higher Degree

Algorithmica
Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
Spelling correction in the PubMed search engine

Information Retrieval
A dictionary for approximate string search and longest prefix search

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
String Matching with Differences by Finite Automata

ICPR '96 Proceedings of the 13th International Conference on Pattern Recognition - Volume 2
Incremental construction of minimal acyclic sequential transducers from unsorted data

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Engineering efficient metric indexes

Pattern Recognition Letters
Compressed indexes for approximate string matching

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Compressed permuterm index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
An(other) Entropy-Bounded Compressed Suffix Tree

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Faster and Space-Optimal Edit Distance "1" Dictionary

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Indexing Variable Length Substrings for Exact and Approximate Matching

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
A Two-Tire Index Structure for Approximate String Matching with Block Moves

Database Systems for Advanced Applications
Simple space-time trade-offs for AESA

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Fully-compressed suffix trees

LATIN'08 Proceedings of the 8th Latin American conference on Theoretical informatics
Brief communication: An efficient similarity search based on indexing in large DNA databases

Computational Biology and Chemistry
Faster adaptive set intersections for text searching

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Dotted suffix trees a structure for approximate text indexing

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Practical compressed suffix trees

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
An efficient algorithm for generating super condensed neighborhoods

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Measuring the perpetrators and funders of typosquatting

FC'10 Proceedings of the 14th international conference on Financial Cryptography and Data Security
Experimental analysis of a fast intersection algorithm for sorted sequences

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
IP-address lookup using LC-tries

IEEE Journal on Selected Areas in Communications
Simple and space-efficient minimal perfect hash functions

WADS'07 Proceedings of the 10th international conference on Algorithms and Data Structures

Super-Linear indices for approximate dictionary searching

SISAP'12 Proceedings of the 5th international conference on Similarity Search and Applications
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)
Efficient error-tolerant query autocompletion

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is, capable of storing and retrieving auxiliary information, such as string identifiers. All solutions are lossless and guarantee retrieval of strings within a specified edit distance k. Benchmark results are presented for the practically important cases of k=1, 2, and 3. We concentrate on natural language datasets, which include synthetic English and Russian dictionaries, as well as dictionaries of frequent words extracted from the ClueWeb09 collection. In addition, we carry out experiments with dictionaries containing DNA sequences. The article is concluded with a discussion of benchmark results and directions for future research.