SPSO: synthetic protein sequence oversampling for imbalanced protein data and remote homology detection

  • Authors:
  • Majid Beigi;Andreas Zell

  • Affiliations:
  • Center for Bioinformatics Tübingen (ZBIT), University of Tübingen, Tübingen, Germany;Center for Bioinformatics Tübingen (ZBIT), University of Tübingen, Tübingen, Germany

  • Venue:
  • ISBMDA'06 Proceedings of the 7th international conference on Biological and Medical Data Analysis
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Our method of oversampling involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. This method is very useful in remote homology detection, and we used real and artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC).