Compression, indexing, and retrieval for massive string data
CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Cache-oblivious index for approximate string matching
Theoretical Computer Science
A linear size index for approximate pattern matching
Journal of Discrete Algorithms
Hi-index | 0.00 |
Approximate string matching is about finding a given string pattern in a text by allowing some degree of errors. In this paper we present a space efficient data structure to solve the 1-mismatch and 1-difference problems. Given a text T of length n over an alphabet A, we can preprocess T and give an $O(n\sqrt{\log n}\log |A|)$-bit space data structure so that, for any query pattern P of length m, we can find all 1-mismatch (or 1-difference) occurrences of P in O(|A|mlog log n+occ) time, where occ is the number of occurrences. This is the fastest known query time given that the space of the data structure is o(nlog 2 n) bits. The space of our data structure can be further reduced to O(nlog |A|) with the query time increasing by a factor of log ε n, for 0ε≤1. Furthermore, our solution can be generalized to solve the k-mismatch (and the k-difference) problem in O(|A| k m k (k+log log n)+occ) and O(log ε n(|A| k m k (k+log log n)+occ)) time using an $O(n\sqrt{\log n}\log |A|)$-bit and an O(nlog |A|)-bit indexing data structures, respectively. We assume that the alphabet size |A| is bounded by $O(2^{\sqrt{\log n}})$ for the $O(n\sqrt{\log n}\log |A|)$-bit space data structure.