Searching for flexible repeated patterns using a non-transitive similarity relation
Pattern Recognition Letters
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Computation and Visualization of Degenerate Repeats in Complete Genomes
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree
LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
Color Set Size Problem with Application to String Matching
CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
RISOTTO: fast extraction of motifs with mismatches
LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
Structural analysis of gapped motifs of a string
MFCS'07 Proceedings of the 32nd international conference on Mathematical Foundations of Computer Science
Removing artifacts of approximated motifs
ITBAM'11 Proceedings of the Second international conference on Information technology in bio- and medical informatics
Parallel motif extraction from very long sequences
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 5.23 |
Finding motifs in biological sequences is one of the most intriguing problems for string algorithm designers due to, on the one hand, the numerous applications of this problem in molecular biology and, on the other hand, the challenging aspects of the computational problem. Indeed, when dealing with biological sequences it is necessary to work with approximations (that is, to identify fragments that are not necessarily identical, but just similar, according to a given similarity notion), and this complicates the problem. Existing algorithms run in time linear with respect to the input size. Nevertheless, the output size can be very large due to the approximation (namely exponential in the approximation degree). This often makes the output unreadable, as well as slowing down the inference itself. A high degree of redundancy has been detected in the set of motifs that satisfy traditional requirements, even for exact motifs. Moreover, it has been observed many times that only a subset of these motifs, namely the maximal motifs, could be enough to provide the information of all of them. In this paper, we aim at removing such redundancy. We extend some notions of maximality already defined for exact motifs to the case of approximate motifs with Hamming distance, and we give a characterization of maximal motifs on the suffix tree. Given that this data structure is used by a whole class of motif extraction tools, we show how these tools can be modified to include the maximality requirement without changing the asymptotical complexity.