Bases of Motifs for Generating Repeated Patterns with Wild Cards

Authors:
Nadia Pisanti;Maxime Crochemore;Roberto Grossi;Marie-France Sagot
Affiliations:
-;-;-;-
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2005

Citing 8
Cited 21

A new approach to text searching

Communications of the ACM
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Efficient string matching: an aid to bibliographic search

Communications of the ACM
An Output-Sensitive Flexible Pattern Discovery Algorithm

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
From profiles to patterns and back again: a branch and bound algorithm for finding near optimal motif profiles

RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Extracting approximate patterns

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching

Motif patterns in 2D

Theoretical Computer Science
Detection of subtle variations as consensus motifs

Theoretical Computer Science
Incremental discovery of the irredundant motif bases for all suffixes of a string in O(n2logn) time

Theoretical Computer Science
Efficient construction of maximal and minimal representations of motifs of a string

Theoretical Computer Science
Optimal extraction of motif patterns in 2D

Information Processing Letters
Masking patterns in sequences: A new class of motif discovery with don't cares

Theoretical Computer Science
Maximal and minimal representations of gapped and non-gapped motifs of a string

Theoretical Computer Science
On the complexity of finding gapped motifs

Journal of Discrete Algorithms
MADMX: a novel strategy for maximal dense motif extraction

WABI'09 Proceedings of the 9th international conference on Algorithms in bioinformatics
Removing artifacts of approximated motifs

ITBAM'11 Proceedings of the Second international conference on Information technology in bio- and medical informatics
Note: Extracting string motif bases for quorum higher than two

Theoretical Computer Science
Incremental discovery of irredundant motif bases in time O(|Σ|n2 log n)

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
Structural analysis of gapped motifs of a string

MFCS'07 Proceedings of the 32nd international conference on Mathematical Foundations of Computer Science
Tiling periodicity

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Optimal offline extraction of irredundant motif bases

COCOON'07 Proceedings of the 13th annual international conference on Computing and Combinatorics
Characterization and extraction of irredundant tandem motifs

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
A parameterizable enumeration algorithm for sequence mining

Theoretical Computer Science
Aligning discovered patterns from protein family sequences

PRIB'12 Proceedings of the 7th IAPR international conference on Pattern Recognition in Bioinformatics
Faster variance computation for patterns with gaps

MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Boolean satisfiability for sequence mining

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph

Applied Intelligence

Quantified Score

Hi-index	0.02

Visualization

Abstract

Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus, smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to efficiently compute such bases unless the quorum is fixed.