Improved Approximate String Matching Using Compressed Suffix Data Structures

Authors:
Tak-Wah Lam;Wing-Kin Sung;Swee-Seong Wong
Affiliations:
The University of Hong Kong, Department of Computer Science, Hong Kong, Hong Kong;National University of Singapore, School of Computing, Singapore, Singapore;National University of Singapore, School of Computing, Singapore, Singapore
Venue:
Algorithmica
Year:
2008

Citing 0
Cited 3

Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Cache-oblivious index for approximate string matching

Theoretical Computer Science
A linear size index for approximate pattern matching

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approximate string matching is about finding a given string pattern in a text by allowing some degree of errors. In this paper we present a space efficient data structure to solve the 1-mismatch and 1-difference problems. Given a text T of length n over an alphabet A, we can preprocess T and give an $O(n\sqrt{\log n}\log |A|)$-bit space data structure so that, for any query pattern P of length m, we can find all 1-mismatch (or 1-difference) occurrences of P in O(|A|mlog log n+occ) time, where occ is the number of occurrences. This is the fastest known query time given that the space of the data structure is o(nlog 2 n) bits. The space of our data structure can be further reduced to O(nlog |A|) with the query time increasing by a factor of log ε n, for 0ε≤1. Furthermore, our solution can be generalized to solve the k-mismatch (and the k-difference) problem in O(|A| k m k (k+log log n)+occ) and O(log ε n(|A| k m k (k+log log n)+occ)) time using an $O(n\sqrt{\log n}\log |A|)$-bit and an O(nlog |A|)-bit indexing data structures, respectively. We assume that the alphabet size |A| is bounded by $O(2^{\sqrt{\log n}})$ for the $O(n\sqrt{\log n}\log |A|)$-bit space data structure.