Searching for flexible repeated patterns using a non-transitive similarity relation
Pattern Recognition Letters
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Computation and Visualization of Degenerate Repeats in Complete Genomes
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Finding Approximate Repetitions under Hamming Distance
ESA '01 Proceedings of the 9th Annual European Symposium on Algorithms
Bases of Motifs for Generating Repeated Patterns with Wild Cards
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Suffix tree characterization of maximal motifs in biological sequences
Theoretical Computer Science
Hi-index | 0.00 |
Frequent patterns (motifs) in biological sequences are good candidates to correspond to structural or functional important elements. The typical output of existing tools for the exhaustive detection of approximated motifs is a long list of motifs containing some real motifs (i.e., patterns representing functional elements) along with a large number of random variations of them, called artifacts. Artifacts increase the output size, often leading to redundant and poorly usable results for biologists. In this paper, we provide a new solution to the problem of separating real motifs from artifacts. We define a notion of motif maximality, called maximality in conservation, which, if applied to the output of existing motif finding tools, allows us to identify and remove artifacts. Their detection is based on the fact that variations of a motif share a large subset of occurrences of the real motif, but the latter is more conserved than any of its artifacts. Experiments show that the tool we implemented according to such definition allows a sensible reduction of the output size removing artifacts with a negligible time cost.