Predicting SUMOylation sites in developmental transcription factors of Drosophila melanogaster

  • Authors:
  • Denis C. Bauer;Fabian A. Buske;Timothy L. Bailey;Mikael Bodén

  • Affiliations:
  • Institute for Molecular Bioscience, The University of Queensland, Australia;Institute for Molecular Bioscience, The University of Queensland, Australia;Institute for Molecular Bioscience, The University of Queensland, Australia;Institute for Molecular Bioscience, The University of Queensland, Australia

  • Venue:
  • Neurocomputing
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

Recent evidence suggests that SUMOylation of proteins plays a keys role in the assembly and dis-assembly of nuclear sub-compartments, as well as gene regulation by reversing the functional role of transcription factors. Determining whether a protein contains a SUMOylation site or not thus provides essential clues about its intra-nuclear spatial association and function. We investigate if the SUMOylation site prediction accuracy can be improved by using machine learning methods integrating non-local and (predicted) structural properties (including secondary structure, solvent accessibility and evolutionary profiles). We use a range of properties available from a target protein's amino acid sequence and the support-vector-machine to demonstrate that local sequence features enable best generalization, with structural features having little to no impact. The support-vector-machine model for SUMOylation sites based on the primary protein sequence achieves an area under the ROC of 0.92 using fivefold cross-validation, and 96% accuracy on an independent hold-out test set, which is superior to previously published methods. However, using a simple consensus motif to scan sequence data exhibits equal performance with reduced computational time and no bias towards the chosen training data. We show that the simple consensus motif makes biologically reasonable predictions and use it to identify specific sites that may explain the dual role ascribed to a set of transcription factors in Drosophila melanogaster.