Motif discovery through predictive modeling of gene regulation

Authors:
Manuel Middendorf;Anshul Kundaje;Mihir Shah;Yoav Freund;Chris H. Wiggins;Christina Leslie
Affiliations:
Department of Physics, Columbia University, New York, NY;Department of Computer Science, Columbia University, New York, NY;Department of Computer Science, Columbia University, New York, NY;Department of Computer Science, Columbia University, New York, NY;Department of Applied Mathematics, Columbia University, New York, NY;Department of Computer Science, Columbia University, New York, NY
Venue:
RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology
Year:
2005

Citing 5
Cited 3

Elements of information theory

Elements of information theory
Unsupervised document classification using sequential information maximization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The Alternating Decision Tree Learning Algorithm

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Probabilistic discovery of overlapping cellular processes and their regulation

RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Predicting genetic regulatory response using classification

Bioinformatics

Modelling transcriptional regulation with a mixture of factor analyzers and variational Bayesian expectation maximization

EURASIP Journal on Bioinformatics and Systems Biology
Testing and validating machine learning classifiers by metamorphic testing

Journal of Systems and Software
A new clustering approach for learning transcriptional modules

International Journal of Data Mining and Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present MEDUSA, an integrative method for learning motif models of transcription factor binding sites PSSMs by incorporating promoter sequence and transcriptome gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting. At each iteration of the algorithm, MEDUSA builds a motif model whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. In this way, we learn motifs that are functional and predictive of regulatory response rather than motifs that are simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model of the transcriptional control logic that can predict the expression of any gene in the organism, given the sequence of the promoter region of the target gene and the expression state of a set of known or putative transcription factors and signaling molecules. Each motif model is either a k-length sequence, a dimer, or a PSSM that is built by agglomerative probabilistic clustering of sequences with similar boosting loss. By applying MEDUSA to a set of environmental stress response expression data in yeast, we learn motifs whose ability to predict differential expression of target genes outperforms motifs from the TRANSFAC dataset and from a previously published candidate set of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed binding sites associated with environmental stress response from the literature.