Fast and simple character classes and bounded gaps pattern matching, with application to protein searching

Authors:
Gonzalo Navarro;Mathieu Raffinot
Affiliations:
Dept. of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile;Equipe génome, cellule et informatique, Université de Versailles, 45 avenue des Etats-Unis, 78035 Versailles Cedex
Venue:
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Year:
2001

Citing 12
Cited 6

From regular expressions to deterministic automata

Theoretical Computer Science
A Four Russians algorithm for regular expression pattern matching

Journal of the ACM (JACM)
A new approach to text searching

Communications of the ACM
Fast text searching: allowing errors

Communications of the ACM
Regular expressions into finite automata

Theoretical Computer Science
Text algorithms

Text algorithms
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
Programming Techniques: Regular expression search algorithm

Communications of the ACM
Fast and flexible string matching by combining bit-parallelism and suffix automata

Journal of Experimental Algorithmics (JEA)
Text-Retrieval: Theory and Practice

Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
Translating Regular Expressions into Small epsilon-Free Nondeterministic Finite Automata

STACS '97 Proceedings of the 14th Annual Symposium on Theoretical Aspects of Computer Science
A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching

CPM '98 Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching

Fast and flexible string matching by combining bit-parallelism and suffix automata

Journal of Experimental Algorithmics (JEA)
High Similarity Sequence Comparison in Clustering Large Sequence Databases

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Accelerating Approximate Subsequence Search on Large Protein Sequence Databases

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Structured motifs search

RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Fast bit-parallel matching for network and regular expressions

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Compressing regular expressions' DFA table by matrix decomposition

CIAA'10 Proceedings of the 15th international conference on Implementation and application of automata

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of fast searching of a pattern that contains Classes of characters and Bounded size Gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK] — x(2, 3) — [DE] — x(2, 3) — Y, where the brackets match any of the letters inside, and x(2, 3) a gap of length between 2 and 3). Currently, the only way to search a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching it with a RE pattern matching algorithm complicates the search and makes it slow. This is the reason why we design in this article two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques. The first one looks exactly once at each text character. The second one does not need to consider all the text characters and hence it is usually faster than the first one, but in bad cases may have to read the same text character more than once. We then propose a criterion based on the form of the CBG to choose a-priori the fastest between both. We performed many practical experiments using the PROSITE database, and all them show that our algorithms are the fastest in virtually all cases.