Boosting binding sites prediction using gene's positions

Authors:
Mohamed Elati;Rim Fekih;Rémy Nicolle;Ivan Junier;Joan Hérisson;François Képès
Affiliations:
ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex
Venue:
WABI'11 Proceedings of the 11th international conference on Algorithms in bioinformatics
Year:
2011

Citing 5
Cited 0

Optimal combinations of pattern classifiers

Pattern Recognition Letters
Bagging predictors

Machine Learning
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
LICORN

Bioinformatics
A brief introduction to boosting

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding transcriptional regulation requires a reliable identification of the DNA binding sites that are recognized by each transcription factor (TF). Building an accurate bioinformatic model of TF-DNA binding is an essential step to differentiate true binding targets from spurious ones. Conventional approches of binding site prediction are based on the notion of consensus sequences. They are formalized by the so-called position-specific weight matrices (PWM) and rely on the statistical analysis of DNA sequence of known binding sites. To improve these techniques, we propose to use genome organization knowledge about the optimal positioning of co-regulated genes along the whole chromosome. For this purpose, we use learning machine approaches to optimally combine sequence information with positioning information. We present a new learning algorithm called PreCisIon, which relies on a TF binding classifier that optimally combines a set of PWMs and chrommosal position based classifiers. This non-linear binding decision rule drastically reduces the rate of false positives so that PRECISION consistently outperforms sequence-based methods. This is shown by implementing a cross-validation analysis in two model organisms: Escherichia coli and Bacillus Subtilis. The analysis is based on the identification of binding sites for 24 TFs; PRECISION achieved on average an AUC (aera under the curve) of 70% and 60%, a sensitivity of 80% and 70%, and a specificity of 60% and 56% for B. subtilis and E. coli, respectively.