Does cost-sensitive learning beat sampling for classifying rare classes?

  • Authors:
  • Kate McCarthy;Bibi Zabar;Gary Weiss

  • Affiliations:
  • Fordham University, Bronx, NY;Fordham University, Bronx, NY;Fordham University, Bronx, NY

  • Venue:
  • UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

A highly-skewed class distribution usually causes the learned classifier to predict the majority class much more often than the minority class. This is a consequence of the fact that most classifiers are designed to maximize accuracy. In many instances, such as for medical diagnosis, the minority class is the class of primary interest and hence this classification behavior is unacceptable. In this paper, we compare two basic strategies for dealing with data that has a skewed class distribution and non-uniform misclassification costs. One strategy is based on cost-sensitive learning while the other strategy employs sampling to create a more balanced class distribution in the training set. We compare two sampling techniques, up-sampling and down-sampling, to the cost-sensitive learning approach. The purpose of this paper is to determine which technique produces the best overall classifier---and under what circumstances.