Suffix tree characterization of maximal motifs in biological sequences

Authors:
Maria Federico;Nadia Pisanti
Affiliations:
Dipartimento di Ingegneria dellInformazione, Università di Modena e Reggio Emilia, Via Vignolese 905, 41100 Modena, Italy;Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy
Venue:
Theoretical Computer Science
Year:
2009

Citing 12
Cited 2

Searching for flexible repeated patterns using a non-transitive similarity relation

Pattern Recognition Letters
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Computation and Visualization of Degenerate Repeats in Complete Genomes

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
Color Set Size Problem with Application to String Matching

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
RISOTTO: fast extraction of motifs with mismatches

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
Structural analysis of gapped motifs of a string

MFCS'07 Proceedings of the 32nd international conference on Mathematical Foundations of Computer Science

Removing artifacts of approximated motifs

ITBAM'11 Proceedings of the Second international conference on Information technology in bio- and medical informatics
Parallel motif extraction from very long sequences

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	5.23

Visualization

Abstract

Finding motifs in biological sequences is one of the most intriguing problems for string algorithm designers due to, on the one hand, the numerous applications of this problem in molecular biology and, on the other hand, the challenging aspects of the computational problem. Indeed, when dealing with biological sequences it is necessary to work with approximations (that is, to identify fragments that are not necessarily identical, but just similar, according to a given similarity notion), and this complicates the problem. Existing algorithms run in time linear with respect to the input size. Nevertheless, the output size can be very large due to the approximation (namely exponential in the approximation degree). This often makes the output unreadable, as well as slowing down the inference itself. A high degree of redundancy has been detected in the set of motifs that satisfy traditional requirements, even for exact motifs. Moreover, it has been observed many times that only a subset of these motifs, namely the maximal motifs, could be enough to provide the information of all of them. In this paper, we aim at removing such redundancy. We extend some notions of maximality already defined for exact motifs to the case of approximate motifs with Hamming distance, and we give a characterization of maximal motifs on the suffix tree. Given that this data structure is used by a whole class of motif extraction tools, we show how these tools can be modified to include the maximality requirement without changing the asymptotical complexity.