Determining the optimal re-sampling strategy for a classification model with imbalanced data using design of experiments and response surface methodologies

Authors:
Lee-Ing Tong;Yung-Chia Chang;Shan-Hui Lin
Affiliations:
Department of Industrial Engineering and Management, Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu 300, Taiwan;Department of Industrial Engineering and Management, Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu 300, Taiwan;Department of Industrial Engineering and Management, Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu 300, Taiwan
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 11
Cited 1

A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

Machine Learning
Applying Machine Learning to Semiconductor Manufacturing

IEEE Expert: Intelligent Systems and Their Applications
The Case against Accuracy Estimation for Comparing Induction Algorithms

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Minority report in fraud detection: classification of skewed data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Does cost-sensitive learning beat sampling for classifying rare classes?

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
An introduction to ROC analysis

Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Exploratory Under-Sampling for Class-Imbalance Learning

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Design and Analysis of Experiments

Design and Analysis of Experiments
The class imbalance problem: A systematic study

Intelligent Data Analysis
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition
Development of a robust data mining method using CBFS and RSM

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics

Application of the Box-Behnken design to the optimization of process parameters in foam cup molding

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

Imbalanced data are common in many machine learning applications. In an imbalanced data set, the number of instances in at least one class is significantly higher or lower than that in other classes. Consequently, when classification models with imbalanced data are developed, most classifiers are subjected to an unequal number of instances in each class, thus failing to construct an effective model. Balancing sample sizes for various classes using a re-sampling strategy is a conventional means of enhancing the effectiveness of a classification model for imbalanced data. Despite numerous attempts to determine the appropriate re-sampling proportion in each class by using a trial-and-error method in order to construct a classification model with imbalanced data (Barandela, Vadovinos, Sanchez, & Ferri, 2004; He, Han, & Wang, 2005; Japkowicz, 2000; McCarthy, Zabar, & Weiss, 2005), the optimal strategy for each class may be infeasible when using such a method. Therefore, this work proposes a novel analytical procedure to determine the optimal re-sampling strategy based on design of experiments (DOE) and response surface methodologies (RSM). The proposed procedure, S-RSM, can be utilized by any classifier. Also, C4.5 algorithm is adopted for illustration. The classification results are evaluated by using the area under the receiver operating characteristic curve (AUC) as a performance measure. Among the several desirable features of the AUC index include independence of the decision threshold and invariance to a priori class probabilities. Furthermore, five real world data sets demonstrate that the higher AUC score of the classification model based on the training data obtained from the S-RSM is than that obtained using oversampling approach or undersampling approach.