Boosting binding sites prediction using gene's positions

  • Authors:
  • Mohamed Elati;Rim Fekih;Rémy Nicolle;Ivan Junier;Joan Hérisson;François Képès

  • Affiliations:
  • ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex;ISSB, Genopole, CNRS, UPS, Université Évry Val d'Essonne, Évry Cedex

  • Venue:
  • WABI'11 Proceedings of the 11th international conference on Algorithms in bioinformatics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Understanding transcriptional regulation requires a reliable identification of the DNA binding sites that are recognized by each transcription factor (TF). Building an accurate bioinformatic model of TF-DNA binding is an essential step to differentiate true binding targets from spurious ones. Conventional approches of binding site prediction are based on the notion of consensus sequences. They are formalized by the so-called position-specific weight matrices (PWM) and rely on the statistical analysis of DNA sequence of known binding sites. To improve these techniques, we propose to use genome organization knowledge about the optimal positioning of co-regulated genes along the whole chromosome. For this purpose, we use learning machine approaches to optimally combine sequence information with positioning information. We present a new learning algorithm called PreCisIon, which relies on a TF binding classifier that optimally combines a set of PWMs and chrommosal position based classifiers. This non-linear binding decision rule drastically reduces the rate of false positives so that PRECISION consistently outperforms sequence-based methods. This is shown by implementing a cross-validation analysis in two model organisms: Escherichia coli and Bacillus Subtilis. The analysis is based on the identification of binding sites for 24 TFs; PRECISION achieved on average an AUC (aera under the curve) of 70% and 60%, a sensitivity of 80% and 70%, and a specificity of 60% and 56% for B. subtilis and E. coli, respectively.