A pruning-based approach for searching precise and generalized region for synthetic minority over-sampling

Authors:
Kamthorn Puntumapon;Kitsana Waiyamai
Affiliations:
Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Thailand;Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Thailand
Venue:
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Year:
2012

Citing 10
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Information Retrieval

Information Retrieval
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition
Margin-based over-sampling method for learning from imbalanced datasets

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

One solution to deal with class imbalance is to modify its class distribution. Synthetic over-sampling is a well-known method to modify class distribution by generating new synthetic minority data. Synthetic Minority Over-sampling TEchnique (SMOTE) is a state-of-the-art synthetic over-sampling algorithm that generates new synthetic data along the line between the minority data and their selected nearest neighbors. Advantages of SMOTE is to have decision regions larger and less specific to original data. However, its drawback is the over-generalization problem where synthetic data is generated into majority class region. Over-generalization leads to misclassify non-minority class region into minority class. To overcome the over-generalization problem, we propose an algorithm, called TRIM, to search for precise minority region while maintaining its generalization. TRIM iteratively filters out irrelevant majority data from the precise minority region. Output of the algorithm is the multiple set of seed minority data, and each individual set will be used for generating new synthetic data. Compared with state-of-the-art over-sampling algorithms, experimental results show significant performance improvement in terms of F-measure and AUC. This suggests over-generalization has a significant impact on the performance of the synthetic over-sampling method.