Super-Linear indices for approximate dictionary searching

Authors:
Leonid Boytsov
Affiliations:
North Bethesda, MD
Venue:
SISAP'12 Proceedings of the 5th international conference on Similarity Search and Applications
Year:
2012

Citing 13
Cited 0

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Dictionary organizations for efficient similarity retrieval

Journal of Systems and Software
Approximate String Matching

ACM Computing Surveys (CSUR)
A hash code method for detecting and correcting spelling errors

Communications of the ACM
A technique for computer detection and correction of spelling errors

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Approximate String-Matching over Suffix Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Fast Approximate Search in Large Dictionaries

Computational Linguistics
Contextual Postprocessing System for Cooperation with a Multiple-Choice Character-Recognition System

IEEE Transactions on Computers
Faster and Space-Optimal Edit Distance "1" Dictionary

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Directly Addressable Variable-Length Codes

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Indexing methods for approximate dictionary searching: Comparative analysis

Journal of Experimental Algorithmics (JEA)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present experimental analysis of approximate search algorithms that involve indexing of deletion neighborhoods. These methods require huge indices whose sizes grow exponentially with respect to the maximum allowable number of errors k. Despite extraordinary space requirements, the super-linear indices are of great interest, because they provide some of the shortest retrieval times. A straightforward implementation that creates a hash index directly over residual strings (obtained by deletions from dictionary words) is not space efficient. Rather than memorizing complete residual strings, we record only deleted characters and their respective positions. These data are indexed using a perfect hash function computed for a set of residual dictionary strings [2]. We carry out an experimental evaluation of this approach against several well-known benchmarks (including FastSS, which stores residual strings directly [3]). Experiments show that our implementation has a comparable or superior performance to that of the fastest benchmarks. At the same time, our implementation requires 4-8 times less space as compared to FastSS.