Instance-based data reduction for improved identification of difficult small classes

  • Authors:
  • Jorma Laurikkala

  • Affiliations:
  • Department of Computer and Information Sciences, University of Tampere, P.O. Box 607, FIN-33014 University of Tampere, Finland. Tel.: +358 3 2157564/ Fax: +358 3 2156070/ E-mail: Jorma.Laurikkala@ ...

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

We studied three different methods to improve identification of small classes, which are also difficult to classify, by balancing an imbalanced class distribution with data reduction. The new method, neighborhood cleaning (NCL) rule, outperformed simple random sampling within classes and one-sided selection method in the experiments with ten real world data sets. All reduction methods improved clearly identification of small classes (20--30%) true-positive rates of the three-nearest neighbor method and the C4.5 decision tree generator, but the differences between the methods were insignificant. However, the significant differences in accuracies, true-positive rates, and true-negative rates obtained from the reduced data were in favor of our method. The results suggest that the NCL rule is a useful method for improving modeling of difficult small classes, as well as for building classifiers that identify these classes from the real world data which frequently have an imbalanced class distribution.