Sample subset optimization for classifying imbalanced biological data

  • Authors:
  • Pengyi Yang;Zili Zhang;Bing B. Zhou;Albert Y. Zomaya

  • Affiliations:
  • School of Information Technologies, University of Sydney, NSW, Australia and NICTA, Australian Technology Park, Eveleigh, NSW, Australia and Centre for Distributed and High Performance Computing, ...;Faculty of Computer and Information Science, Southwest University, China and School of Information Technology, Deakin University, VIC, Australia;School of Information Technologies, University of Sydney, NSW, Australia and Centre for Distributed and High Performance Computing, University of Sydney, NSW, Australia;School of Information Technologies, University of Sydney, NSW, Australia and Centre for Distributed and High Performance Computing, University of Sydney, NSW, Australia

  • Venue:
  • PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.