Indexing Variable Length Substrings for Exact and Approximate Matching

Authors:
Gonzalo Navarro;Leena Salmela
Affiliations:
Department of Computer Science, University of Chile,;Department of Computer Science and Engineering, Helsinki University of Technology,
Venue:
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Year:
2009

Citing 6
Cited 5

Fast text searching: allowing errors

Communications of the ACM
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Fast and practical approximate string matching

Information Processing Letters
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
Improved Variable-to-Fixed Length Codes

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval

Indexing methods for approximate dictionary searching: Comparative analysis

Journal of Experimental Algorithmics (JEA)
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient indexing algorithms for approximate pattern matching in text

Proceedings of the Seventeenth Australasian Document Computing Symposium
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce two new index structures based on the q -gram index. The new structures index substrings of variable length instead of q -grams of fixed length. For both of the new indexes, we present a method based on the suffix tree to efficiently choose the indexed substrings so that each of them occurs almost equally frequently in the text. Our experiments show that the resulting indexes are up to 40% faster than the q -gram index when they use the same space.