Elements of information theory
Elements of information theory
Unsupervised document classification using sequential information maximization
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The Alternating Decision Tree Learning Algorithm
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Probabilistic discovery of overlapping cellular processes and their regulation
RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Predicting genetic regulatory response using classification
Bioinformatics
EURASIP Journal on Bioinformatics and Systems Biology
Testing and validating machine learning classifiers by metamorphic testing
Journal of Systems and Software
A new clustering approach for learning transcriptional modules
International Journal of Data Mining and Bioinformatics
Hi-index | 0.00 |
We present MEDUSA, an integrative method for learning motif models of transcription factor binding sites PSSMs by incorporating promoter sequence and transcriptome gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting. At each iteration of the algorithm, MEDUSA builds a motif model whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. In this way, we learn motifs that are functional and predictive of regulatory response rather than motifs that are simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model of the transcriptional control logic that can predict the expression of any gene in the organism, given the sequence of the promoter region of the target gene and the expression state of a set of known or putative transcription factors and signaling molecules. Each motif model is either a k-length sequence, a dimer, or a PSSM that is built by agglomerative probabilistic clustering of sequences with similar boosting loss. By applying MEDUSA to a set of environmental stress response expression data in yeast, we learn motifs whose ability to predict differential expression of target genes outperforms motifs from the TRANSFAC dataset and from a previously published candidate set of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed binding sites associated with environmental stress response from the literature.