Improving the bit-parallel NFA of Baeza-Yates and Navarro for approximate string matching

Authors:
Heikki Hyyrö
Affiliations:
Department of Computer Sciences, University of Tampere, Finland
Venue:
Information Processing Letters
Year:
2008

Citing 9
Cited 0

Fast text searching: allowing errors

Communications of the ACM
Text algorithms

Text algorithms
A subquadratic algorithm for approximate regular expression matching

Journal of Algorithms
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
A technique for computer detection and correction of spelling errors

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Tighter packed bit-parallel NFA for approximate string matching

CIAA'06 Proceedings of the 11th international conference on Implementation and Application of Automata
New bit-parallel indel-distance algorithm

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms

Quantified Score

Hi-index	0.89

Visualization

Abstract

We propose a new variant of the bit-parallel NFA of Baeza-Yates and Navarro (BPD) for approximate string matching [R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (1999) 127-158]. BPD is one of the most practical approximate string matching algorithms under moderate pattern lengths and error levels [G. Myers, A fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM 46 (3) 1989 395-415; G. Navarro, M. Raffinot, Flexible Pattern Matching in Strings-Practical On-line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge, UK, 2002]. Given a length-m pattern and an error threshold k, the original BPD requires (m-k)(k+2) bits of space to represent an NFA with (m-k)(k+1) states. In this paper we remove redundancy from the original NFA representation. Our variant requires (m-k)(k+1) bits of space, which is optimal in the sense that exactly one bit per state is used. The space efficiency is achieved by using an alternative, but equally or even more efficient, simulation algorithm for the bit-parallel NFA. We also present experimental results to compare our modified NFA against the original BPD and its main competitors. Our new variant is more efficient than the original BPD, and it hence takes over/extends the role of the original BPD as one of the most practical approximate string matching algorithms under moderate values of k and m.