Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques

  • Authors:
  • Putthiporn Thanathamathee;Chidchanok Lursinsap

  • Affiliations:
  • -;-

  • Venue:
  • Pattern Recognition Letters
  • Year:
  • 2013

Quantified Score

Hi-index 0.10

Visualization

Abstract

The problem of imbalanced data between classes prevails in various applications such as bioinformatics. The correctness of prediction in case of imbalanced data is usually biased towards the majority class. However, in several applications, the accuracy of prediction in minority class is also significant as much as in majority class. Previously, there were many techniques proposed to increase the accuracy in minority class. These techniques are based on the concept of re-sampling, which can be over-sampling and under-sampling, during the training process. Those re-sampling techniques did not considered how the data are scattered in the space. In this paper, we proposed a new technique based on the fact that the location of separating function in between any two sub-clusters in different classes is defined only by the boundary data of each sub-cluster. In addition, the accuracy is measured only by the testing set. Our technique adapted the concept of bootstrapping to estimate new region of each sub-cluster and synthesize the new boundary data. The new region is for coping with the unseen testing data. All new synthesized data were classified by using the concept of AdaBoost algorithm. Our results outperformed the other techniques under several performance evaluating functions.