Machine learning methods for transcription data integration

Authors:
D. T. Holloway;M. A. Kon;C. DeLisi
Affiliations:
-;-;-
Venue:
IBM Journal of Research and Development - Systems biology
Year:
2006

Citing 7
Cited 0

Gene functional classification from heterogeneous data

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Using the Fisher Kernel Method to Detect Remote Protein Homologies

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Support vector machine classification on the web

Bioinformatics
Combining pattern discovery and discriminant analysis to predict gene co-regulation

Bioinformatics
A statistical framework for genomic data fusion

Bioinformatics
Regulatory motif finding by logic regression

Bioinformatics
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gene expression is modulated by transcription factors (TFs), which are proteins that generally bind to DNA adjacent to coding regions and initiate transcription. Each target gene can be regulated by more than one TF, and each TF can regulate many targets. For a complete molecular understanding of transcriptional regulation, researchers must first associate each TF with the set of genes that it regulates. Here we present a summary of completed work on the ability to associate 104 TFs with their binding sites using support vector machines (SVMs), which are classification algorithms based in statistical learning theory. We use several types of genomic datasets to train classifiers in order to predict TF binding in the yeast genome. We consider motif matches, subsequence counts, motif conservation, functional annotation, and expression profiles. A simple weighting scheme varies the contribution of each type of genomic data when building a final SVM classifier, which we evaluate using known binding sites published in the literature and in online databases. The SVM algorithm works best when all datasets are combined, producing 73% coverage of known interactions, with a prediction accuracy of almost 0.9. We discuss new ideas and preliminary work for improving SVM classification of biological data.