Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques

Authors:
Putthiporn Thanathamathee;Chidchanok Lursinsap
Affiliations:
-;-
Venue:
Pattern Recognition Letters
Year:
2013

Citing 17
Cited 1

Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Minority report in fraud detection: classification of skewed data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A multistrategy approach for digital text categorization from imbalanced documents

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem

IEEE Transactions on Knowledge and Data Engineering
Boosting with data generation: improving the classification of hard to learn examples

IEA/AIE'2004 Proceedings of the 17th international conference on Innovations in applied artificial intelligence
Active learning for class imbalance problem

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
Learning on the border: active learning in imbalanced data classification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Boosting support vector machines for imbalanced data sets

Knowledge and Information Systems
RAMOBoost: ranked minority oversampling in boosting

IEEE Transactions on Neural Networks
ROC analysis as a useful tool for performance evaluation of artificial neural networks

ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part II
Boosting prediction accuracy on imbalanced datasets with SVM ensembles

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique

Applied Intelligence

Class imbalance and the curse of minority hubs

Knowledge-Based Systems

Quantified Score

Hi-index	0.10

Visualization

Abstract

The problem of imbalanced data between classes prevails in various applications such as bioinformatics. The correctness of prediction in case of imbalanced data is usually biased towards the majority class. However, in several applications, the accuracy of prediction in minority class is also significant as much as in majority class. Previously, there were many techniques proposed to increase the accuracy in minority class. These techniques are based on the concept of re-sampling, which can be over-sampling and under-sampling, during the training process. Those re-sampling techniques did not considered how the data are scattered in the space. In this paper, we proposed a new technique based on the fact that the location of separating function in between any two sub-clusters in different classes is defined only by the boundary data of each sub-cluster. In addition, the accuracy is measured only by the testing set. Our technique adapted the concept of bootstrapping to estimate new region of each sub-cluster and synthesize the new boundary data. The new region is for coping with the unseen testing data. All new synthesized data were classified by using the concept of AdaBoost algorithm. Our results outperformed the other techniques under several performance evaluating functions.