On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

  • Authors:
  • Victoria López;Alberto Fernández;Francisco Herrera

  • Affiliations:
  • -;-;-

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2014

Quantified Score

Hi-index 0.07

Visualization

Abstract

In the field of Data Mining, the estimation of the quality of the learned models is a key step in order to select the most appropriate tool for the problem to be solved. Traditionally, a k-fold validation technique has been carried out so that there is a certain degree of independency among the results for the different partitions. In this way, the highest average performance will be obtained by the most robust approach. However, applying a ''random'' division of the instances over the folds may result in a problem known as dataset shift, which consists in having a different data distribution between the training and test folds. In classification with imbalanced datasets, in which the number of instances of one class is much lower than the other class, this problem is more severe. The misclassification of minority class instances due to an incorrect learning of the real boundaries caused by a not well fitted data distribution, truly affects the measures of performance in this scenario. Regarding this fact, we propose the use of a specific validation technique for the partitioning of the data, known as ''Distribution optimally balanced stratified cross-validation'' to avoid this harmful situation in the presence of imbalance. This methodology makes the decision of placing close-by samples on different folds, so that each partition will end up with enough representatives of every region. We have selected a wide number of imbalanced datasets from KEEL dataset repository for our study, using several learning techniques from different paradigms, thus making the conclusions extracted to be independent of the underlying classifier. The analysis of the results has been carried out by means of the proper statistical study, which shows the goodness of this approach for dealing with imbalanced data.