Finding motifs for insufficient number of sequences with strong binding to transcription facto

Authors:
Francis Y. L. Chin;Henry C. M. Leung;S. M. Yiu;T. W. Lam;Roni Rosenfeld;W. W. Tsang;David K. Smith;Y. Jiang
Affiliations:
The University of Hong Kong, Hong Kong;The University of Hong Kong, Hong Kong;The University of Hong Kong, Hong Kong;The University of Hong Kong, Hong Kong;Carnegie Mellon University;The University of Hong Kong,Hong Kong;The University of Hong Kong,Hong Kong;The University of Hong Kong,Hong Kong
Venue:
RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Year:
2004

Citing 2
Cited 4

Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization

Machine Learning - Special issue on applications in molecular biology
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology

DNA Motif Representation with Nucleotide Dependency

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An efficient motif discovery algorithm with unknown motif length and number of binding sites

International Journal of Data Mining and Bioinformatics
Generalized planted (l,d)-motif problem with negative set

WABI'05 Proceedings of the 5th International conference on Algorithms in Bioinformatics
Graphical approach to weak motif recognition in noisy data sets

PRIB'06 Proceedings of the 2006 international conference on Pattern Recognition in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding motifs is an important problem in computational biology. Our paper makes two major contributions to this problem. Firstly, we better characterize the types of problem instances that cannot be solved by most existing methods of finding motifs. Secondly, we introduce a different method, which is shown to succeed for various problem instances for which popular existing methods fail.Most existing computational methods to finding motifs are based on the strong-signal model wherein only strong-signal sequences (i.e. those that are known to contain binding sites very similar to the motif) are considered as input and weak-signal sequences (i.e. those do not contain any sub-string similar to the motif) are disregarded.Buhler and Tompa have studied the limitations of methods based on the strong-signal model. They characterized the problem instances for which the motif is unlikely to be found in terms of the number of input (strong-signal) sequences needed under the assumption that each input sequence contains exactly one binding site. They further gave a method to calculate the minimum number of input sequences required.We re-characterize the limitations of the strong-signal model in terms of the minimum total number of binding sites, rather than the minimum number of strong-signal sequences, required to be in the input data set. We use a probability matrix to represent a motif instead of a string pattern to calculate the minimum total number of binding sites required. This new characterization is shown to be more general and realistic.Next, we introduce a more general and realistic energy-based model, which considers all available sequences (including weak-signal sequences) with varying degrees of binding strength to the transcription factors (as measured experimentally by observed color intensity). Given varying degrees of binding strength, our model can consider sequences ranging from those that contain more than one binding site to those that are weak sequences. By treating sequences with different degrees of binding strength differently, we develop a heuristic algorithm called EBMF (Energy-Based Motif Finding algorithm) using an EM-like approach to find motifs under our model. This EBMF algorithm can find motifs for data sets that do not even have the required minimum number of binding sites as previously derived for the strong-signal model. Our algorithm compares favorably with common motif-finding programs AlignACE and MEME, which are based on the strong-signal model. In particular, for some simulated and real data sets, our algorithm finds the motif when both AlignACE and MEME fail to do so.