An efficient algorithm for finding long conserved regions between genes

Authors:
Tak-Man Ma;Yuh-Dauh Lyuu;Yen-Wu Ti
Affiliations:
Dept. of Computer and Information Science, University of Pennsylvania, Philadelphia;Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan;Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
Venue:
CompLife'06 Proceedings of the Second international conference on Computational Life Sciences
Year:
2006

Citing 5
Cited 0

An Algorithm for Locating Nonoverlapping Regions of Maximum Alignment Score

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Reducing the space requirement of suffix trees

Software—Practice & Experience
Computation and Visualization of Degenerate Repeats in Complete Genomes

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
A Space Efficient Algorithm for Finding the Best Non-Overlapping Alignment Score

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of approximate non-tandem repeat (conserved regions) extraction among strings (genes). Basically, given a string S and thresholds L and D over a finite alphabet, extracting approximate repeats is to find pairs (β, β′) of substrings of S under some constraints such that β and β′ have edit-distance at most D and their respective lengths are at least L. Previous works mainly focus on the case that D is small, so they are not appropriate for extracting approximate repeats with relatively large D. In contrast, this paper focuses on extracting long approximate repeats with large D and it is more efficient than previous works. We also show that our algorithm is optimal in time when D is a constant. In this paper, given an input string S and thresholds L and D, we would like to extract all (D, L)-supermaximal approximate repeats (β, β′) of S. One useful application of extracting all (D, L)-supermaximal approximate repeats (β, β′) is to find all longest possible substrings β of S such that there exist some other substring β′ of S where β and β′ have edit-distance at most D and their respective lengths are at least L. This algorithm can be easily applied to the case where there are multiple input strings S1,S2,...,Sn if we first concatenate the input strings into one long subject string S with a special symbol $``\sharp"$ for separation: $S_1\sharp S_2\sharp\ldots\sharp S_n$. The running time complexity of our algorithm is O(DN2) where N=|S1|+|S2|+⋯+|Sn|.