SVM Learning from Imbalanced Data by GA Sampling for Protein Domain Prediction

  • Authors:
  • Shuxue Zou;Yanxin Huang;Yan Wang;Jianxin Wang;Chunguang Zhou

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • ICYCS '08 Proceedings of the 2008 The 9th International Conference for Young Computer Scientists
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The performance of Support Vector Machines (SVM) drops significantly while facing imbalanced datasets, though it has been extensively studied and has shown remarkable success in many applications. Some researchers have pointed out that it is difficult to avoid such decrease when trying to improve the efficient of SVM on imbalanced datasets by modifying the algorithm itself only. Therefore, as the pretreatment of data, sampling is a popular strategy to handle the class imbalance problem since it re-balances the dataset directly. In this paper, we proposed a novel sampling method based on Genetic Algorithms (GA) to rebalance the imbalanced training dataset for SVM. In order to evaluating the final classifiers more impartiality, AUC (Area Under ROC Curve) is employed as the fitness function in GA. The experimental results show that the sampling strategy based on GA outperforms the random sampling method. And our method is prior to individual SVM for imbalanced protein domain boundary prediction. The accuracy of the prediction is about 70% with the AUC value 0.905.