Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains

  • Authors:
  • Sareewan Dendamrongvit;Miroslav Kubat

  • Affiliations:
  • Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL;Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL

  • Venue:
  • PAKDD'09 Proceedings of the 13th Pacific-Asia international conference on Knowledge discovery and data mining: new frontiers in applied data mining
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text categorization is an important application domain of multilabel classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F1. We also show how a slight modification of an older undersampling technique helps further improve the results.