Using pre & post-processing methods to improve binding site predictions

  • Authors:
  • Yi Sun;Cristina González Castellano;Mark Robinson;Rod Adams;Alistair G. Rust;Neil Davey

  • Affiliations:
  • Science and Technology Research Institute, University of Hertfordshire, College Lane, Hatfield, Hertfordshire, AL10 9AB, UK;IgnosEstudiode IngenieríaS.L., Calle San Juan, 10 La Laguna, Santa Cruz de Tenerife, Canary Islands, C.P. 38203, Spain;Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA;Science and Technology Research Institute, University of Hertfordshire, College Lane, Hatfield, Hertfordshire, AL10 9AB, UK;Institute for Systems Biology, 1441 North 34th Street, Seattle, WA 98103, USA;Science and Technology Research Institute, University of Hertfordshire, College Lane, Hatfield, Hertfordshire, AL10 9AB, UK

  • Venue:
  • Pattern Recognition
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

Currently the best algorithms for transcription factor binding site prediction within sequences of regulatory DNA are severely limited in accuracy. In this paper, we integrate 12 original binding site prediction algorithms, and use a 'window' of consecutive predictions in order to contextualise the neighbouring results. We combine either random selection or Tomek links under-sampling with SMOTE over-sampling techniques. In addition, we investigate the behaviour of four feature selection filtering methods: bi-normal separation, correlation coefficients, F-Score and a cross entropy based algorithm. Finally, we remove some of the final predicted binding sites on the basis of their biological plausibility. The results show that we can generate a new prediction that significantly improves on the performance of any one of the individual algorithms.