VARUN: Discovering Extensible Motifs under Saturation Constraints

Authors:
Alberto Apostolico;Matteo Comin;Laxmi Parida
Affiliations:
Georgia Institute of Technology, Atlanta and University of Padova, Padova,;University of Padova, Padova;IBM T.J. Watson Research Center, Yorktown Heights
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2010

Citing 8
Cited 6

Motif discovery without alignment or enumeration (extended abstract)

RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications

Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications
Finding motifs in the twilight zone

Proceedings of the sixth annual international conference on Computational biology
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An inexact-suffix-tree-based algorithm for detecting extensible patterns

Theoretical Computer Science - Pattern discovery in the post genome
Conservative extraction of over-represented extensible motifs

Bioinformatics
Pattern Discovery in Bioinformatics: Theory & Algorithms

Pattern Discovery in Bioinformatics: Theory & Algorithms

MADMX: a novel strategy for maximal dense motif extraction

WABI'09 Proceedings of the 9th international conference on Algorithms in bioinformatics
Characterization and extraction of irredundant tandem motifs

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Efficient parallel construction of suffix trees for genomes larger than main memory

Proceedings of the 20th European MPI Users' Group Meeting
Parallel motif extraction from very long sequences

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Fast computation of entropic profiles for the detection of conservation in genomes

PRIB'13 Proceedings of the 8th IAPR international conference on Pattern Recognition in Bioinformatics
Rime: Repeat identification

Discrete Applied Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The discovery of motifs in biosequences is frequently torn between the rigidity of the model on one hand and the abundance of candidates on the other hand. In particular, motifs that include wild cards or “don't cares” escalate exponentially with their number, and this gets only worse if a don't care is allowed to stretch up to some prescribed maximum length. In this paper, a notion of extensible motif in a sequence is introduced and studied, which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. It is shown that a combination of appropriate saturation conditions and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate overrepresented motifs. A suite of software programs called Varun鹿 is described, implementing the discovery of extensible motifs of the type considered. The merits of the method are then documented by results obtained in a variety of experiments primarily targeting protein sequence families. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constraints.