Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Verifying candidate matches in sparse and wildcard matching
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Comparing the size of NFAs with and without ε-transitions
Theoretical Computer Science
Regular expression constrained sequence alignment
Journal of Discrete Algorithms
SA-REPC: sequence alignment with regular expression path constraint
LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
Hi-index | 0.00 |
We define a novel variation on the constrained sequence alignment problem in which the constraint is given in the form of a regular expression. Given two sequences, an alphabet @C describing pairwise sequence alignment operations, and a regular expression R over @C, the problem is to compute the highest scoring sequence alignment A of the given sequences, such that A@?@C^@?L(R)@C^@?. Two algorithms are given for solving this problem. The first basic algorithm is general and solves the problem in O(nmrlog^2r) time and O(min{n,m}r) space, where m and n are the lengths of the two sequences and r is the size of the NFA for R. The second algorithm is restricted to rigid patterns and exploits this restriction to reduce the NFA size factor r in the time complexity to a smaller factor corresponding to the length of the rigid pattern. A rigid pattern P is a regular expression of the form P=P"1@?...@?P"k, where P"i does not contain the Kleene-closure star or union. |P| is compacted by supporting alignment patterns P that do not contain the Kleene-closure star, and exploits this constraint to reduce the NFA size factor r in the time complexity to a smaller factor |P|. |P| is compacted by supporting alignment patterns extended by meta-characters including general insertion, deletion and match operations, as well as some cases of substitutions. meta-characters used in P. {m,i}^@? or P@?(@C@?{m,d})^@?, the problem can be solved in time O(nm), while for a pattern P@?(@C@?{m,i,d})^@?, the problem can be solved in time O(nmlog|P|). For a pattern P@?(@C@?{m,s,i,d})^@?, the problem can be solved in time O(nmlog|P|) in some cases: one case is for scoring functions Score for which there exists Score^':@S-R such that Score(@n,@s)=Score^'(@n)+Score^'(@s) for every @n@s, and the other is when occ"s(P)=O(log(max{n,m})). For a rigid pattern P=P"1@?...@?P"k, these time bounds range from O(knm) to O(knmlog(max{|P"i|})), depending on the meta-characters used in P. An additional result obtained along the way is an extension of the algorithm of Fischer and Paterson for String Matching with Wildcards. Our extension allows the input strings to include ''negation symbols'' (that match all letters but a specific one) while retaining the original time complexity. We implemented both algorithms and applied them to data-mine new miRNA seeding patterns in C. elegans Clip-seq experimental data.