Training neural networks for protein secondary structure prediction: the effects of imbalanced data set

Authors:
Viviane Palodeto;Hernán Terenzi;Jefferson Luiz Brum Marques
Affiliations:
Biomedical Engineering Institute, Federal University of Santa Catarina, Florianópolis, SC, Brazil;Biochemistry Department, Federal University of Santa Catarina, Florianópolis, SC, Brazil;Biomedical Engineering Institute, Federal University of Santa Catarina, Florianópolis, SC, Brazil
Venue:
ICIC'09 Proceedings of the Intelligent computing 5th international conference on Emerging intelligent computing technology and applications
Year:
2009

Citing 8
Cited 0

Using Knowledge-Based Neural Networks to Improve Algorithms: Refining the Chou–Fasman Algorithm for Protein Folding

Machine Learning - Special issue on multistrategy learning
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Classification and knowledge discovery in protein databases

Journal of Biomedical Informatics - Special issue: Biomedical machine learning
Introduction to Mathematical Methods in Bioinformatics (Universitext)

Introduction to Mathematical Methods in Bioinformatics (Universitext)
The class imbalance problem: A systematic study

Intelligent Data Analysis
Prediction of Protein Secondary Structure with two-stage multi-class SVMs

International Journal of Data Mining and Bioinformatics
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Protein secondary structure prediction (PSSP) is one of the main tasks in computational biology. During the last few decades, much effort has been made towards solving this problem, with various approaches, mainly artificial neural networks (ANN). Generally, in order to predict the protein secondary structure, the ANN training process is performed using CB513 data set. Like protein structures databases, this data set is imbalanced and it can cause a low error rate for the majority class and an undesirable error rate for the minority class. In this paper we evaluate the effects of an imbalanced data set in training and learning of neural networks when they are applied to predict protein secondary structure. For this we applied resampling methods to tackle the imbalance class problem. Results show that imbalanced data sets decrease the helixes predictions rates. Although, protein data set distribution does not affect significantly the global accuracy (Q3).