Sample subset optimization for classifying imbalanced biological data

Authors:
Pengyi Yang;Zili Zhang;Bing B. Zhou;Albert Y. Zomaya
Affiliations:
School of Information Technologies, University of Sydney, NSW, Australia and NICTA, Australian Technology Park, Eveleigh, NSW, Australia and Centre for Distributed and High Performance Computing, ...;Faculty of Computer and Information Science, Southwest University, China and School of Information Technology, Deakin University, VIC, Australia;School of Information Technologies, University of Sydney, NSW, Australia and Centre for Distributed and High Performance Computing, University of Sydney, NSW, Australia;School of Information Technologies, University of Sydney, NSW, Australia and Centre for Distributed and High Performance Computing, University of Sydney, NSW, Australia
Venue:
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Year:
2011

Citing 14
Cited 2

Bagging predictors

Machine Learning
A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins

Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An introduction to ROC analysis

Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Analysis of E.coli promoter recognition problem in dinucleotide feature space

Bioinformatics
The class imbalance problem: A systematic study

Intelligent Data Analysis
A dual coordinate descent method for large-scale linear SVM

Proceedings of the 25th international conference on Machine learning
microPred

Bioinformatics
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Roughly balanced bagging for imbalanced data

Statistical Analysis and Data Mining - Best of SDM'09
A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems

ICMLA '09 Proceedings of the 2009 International Conference on Machine Learning and Applications
Boosting prediction accuracy on imbalanced datasets with SVM ensembles

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Application of majority voting to pattern recognition: an analysis of its behavior and performance

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

A generic classifier-ensemble approach for biomedical named entity recognition

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
GSVM: An SVM for handling imbalanced accuracy between classes inbi-classification problems

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.