Reducing overfitting in predicting intrinsically unstructured proteins

  • Authors:
  • Pengfei Han;Xiuzhen Zhang;Raymond S. Norton;Zhiping Feng

  • Affiliations:
  • School of Computer Science and IT, RMIT University, Melbourne, VIC, Australia;School of Computer Science and IT, RMIT University, Melbourne, VIC, Australia;The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia;The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia

  • Venue:
  • PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.