Simultaneously learning DNA motif along with its position and sequence rank preferences through EM algorithm

Authors:
ZhiZhuo Zhang;Cheng Wei Chang;Willy Hugo;Edwin Cheung;Wing-Kin Sung
Affiliations:
National University of Singapore, Singapore;Genome Institute of Singapore, Singapore;National University of Singapore, Singapore;Genome Institute of Singapore, Singapore;National University of Singapore, Singapore and Genome Institute of Singapore, Singapore
Venue:
RECOMB'12 Proceedings of the 16th Annual international conference on Research in Computational Molecular Biology
Year:
2012

Citing 10
Cited 0

Importance sampling for stochastic simulations

Management Science
A Statistical Method for Finding Transcription Factor Binding Sites

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

COCOON '02 Proceedings of the 8th Annual International Conference on Computing and Combinatorics
A Uniform Projection Method for Motif Discovery in DNA Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
On counting position weight matrix matches in a sequence, with application to discriminative motif finding

Bioinformatics
RankMotif++

Bioinformatics
Localized motif discovery in gene regulatory sequences

Bioinformatics
CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units

Pattern Recognition Letters
Deep and wide digging for binding motifs in ChIP-Seq data

Bioinformatics
DREME

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e. position preference and sequence rank preference). This information is usually required from the user. This paper presents a de novo motif discovery algorithm called SEME which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large scale synthetic datasets, 32 metazoan compendium benchmark datasets and 164 ChIP-Seq libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (co-TF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct co-TF motifs and, at the same time, predicted co-TF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each co-TF reveals potential interaction mechanisms between the primary TF and the co-TF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the co-TFs.