Random forest-based prediction of protein sumoylation sites from sequence features

  • Authors:
  • Shaolei Teng;Hong Luo;Liangjiang Wang

  • Affiliations:
  • Clemson University, Clemson, SC;Clemson University, Clemson, SC;Clemson University, Clemson, SC and Greenwood Genetic Center, Greenwood, SC

  • Venue:
  • Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Protein sumoylation play essential roles in the eukaryotic cell and any alterations in this process may cause various human diseases. This paper describes a new machine learning approach for the sumoylation site prediction from protein sequence information. Random Forests (RFs), which can handle a large number of input variables and avoid model overfitting, were trained with the data collected from literature. To construct accurate classifiers, forty sequence features were selected for input vector encoding. The results suggested that RF classifier performance was affected by the sequence context of sumoylation sites, and the use of eighteen residues with the core motif ψKXE in the middle gave the highest performance (ROC AUC = 0.9328). The RF classifiers were also found to outperform support vector machine (SVM) models on the same dataset. Thus, the RF algorithm appears to be the best choice for accurate prediction of protein sumoylation sites from sequence features.