Predicting the protein solubility by integrating chaos games representation and entropy in information theory

  • Authors:
  • Niu Xiaohui;Shi Feng;Hu Xuehai;Xia Jingbo;Li Nana

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2014

Quantified Score

Hi-index 12.05

Visualization

Abstract

Protein solubility is a prerequisite for many structural, functional studies. Predicting the propensity of a protein to be soluble or to form inclusion body is a challenging and crucial problem. In order to formulate the protein samples which can reflect the intrinsic correlation with protein solubility, triangle, quadrangle and 12-vertex polygon CGR, the concept of entropy in information theory, together with amino acid and dipeptide compositions are applied based on a different mode of pseudo amino acid composition (PseAAC). The mathematical expressions involving with seven CGR methods and amino acid, dipeptide compositions with their corresponding entropies are evaluated with 10-fold cross validation and re-substitution test. The numerical results confirm that the introduction of the entropy can significantly improve the performance of the classifiers. Triangle CGR method surpass the two other CGR methods in classifier construction. It can provide complementary sequence-order information on the basis of dipeptide composition. The optimal mathematical expression is dipeptide composition, triangle CGR and their entropies. With the 2-level triangle polygon CGR+dipeptide composition together with their corresponding entropies as the mathematical feature, the classifier achieved the best accuracy 88.45% and MCC achieved 0.7588 in 10-fold cross validation test. In the re-substitution test, the 3-level triangle polygon CGR, dipeptide composition and their entropies perform best, its accuracy was 92.38%, MCC achieved 0.8387.