Motif discovery without alignment or enumeration (extended abstract)
RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications
Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications
Finding motifs in the twilight zone
Proceedings of the sixth annual international conference on Computational biology
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An inexact-suffix-tree-based algorithm for detecting extensible patterns
Theoretical Computer Science - Pattern discovery in the post genome
Pattern Discovery in Bioinformatics: Theory & Algorithms
Pattern Discovery in Bioinformatics: Theory & Algorithms
MADMX: a novel strategy for maximal dense motif extraction
WABI'09 Proceedings of the 9th international conference on Algorithms in bioinformatics
Characterization and extraction of irredundant tandem motifs
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Efficient parallel construction of suffix trees for genomes larger than main memory
Proceedings of the 20th European MPI Users' Group Meeting
Parallel motif extraction from very long sequences
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Fast computation of entropic profiles for the detection of conservation in genomes
PRIB'13 Proceedings of the 8th IAPR international conference on Pattern Recognition in Bioinformatics
Discrete Applied Mathematics
Hi-index | 0.00 |
The discovery of motifs in biosequences is frequently torn between the rigidity of the model on one hand and the abundance of candidates on the other hand. In particular, motifs that include wild cards or “don't cares” escalate exponentially with their number, and this gets only worse if a don't care is allowed to stretch up to some prescribed maximum length. In this paper, a notion of extensible motif in a sequence is introduced and studied, which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. It is shown that a combination of appropriate saturation conditions and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate overrepresented motifs. A suite of software programs called Varun鹿 is described, implementing the discovery of extensible motifs of the type considered. The merits of the method are then documented by results obtained in a variety of experiments primarily targeting protein sequence families. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constraints.