Training neural networks for protein secondary structure prediction: the effects of imbalanced data set

  • Authors:
  • Viviane Palodeto;Hernán Terenzi;Jefferson Luiz Brum Marques

  • Affiliations:
  • Biomedical Engineering Institute, Federal University of Santa Catarina, Florianópolis, SC, Brazil;Biochemistry Department, Federal University of Santa Catarina, Florianópolis, SC, Brazil;Biomedical Engineering Institute, Federal University of Santa Catarina, Florianópolis, SC, Brazil

  • Venue:
  • ICIC'09 Proceedings of the Intelligent computing 5th international conference on Emerging intelligent computing technology and applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Protein secondary structure prediction (PSSP) is one of the main tasks in computational biology. During the last few decades, much effort has been made towards solving this problem, with various approaches, mainly artificial neural networks (ANN). Generally, in order to predict the protein secondary structure, the ANN training process is performed using CB513 data set. Like protein structures databases, this data set is imbalanced and it can cause a low error rate for the majority class and an undesirable error rate for the minority class. In this paper we evaluate the effects of an imbalanced data set in training and learning of neural networks when they are applied to predict protein secondary structure. For this we applied resampling methods to tackle the imbalance class problem. Results show that imbalanced data sets decrease the helixes predictions rates. Although, protein data set distribution does not affect significantly the global accuracy (Q3).