Taking advantage of the web for text classification with imbalanced classes

  • Authors:
  • Rafael Guzmán-Cabrera;Manuel Montes-y-Gómez;Paolo Rosso;Luis Villaseñor-Pineda

  • Affiliations:
  • FIMEE, Universidad de Guanajuato, Mexico and DSIC, Universidad Politécnica de Valencia, Spain;LTL, Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico;DSIC, Universidad Politécnica de Valencia, Spain;LTL, Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico

  • Venue:
  • MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A problem of supervised approaches for text classification is that they commonly require high-quality training data to construct an accurate classifier. Unfortunately, in many real-world applications the training sets are extremely small and present imbalanced class distributions. In order to confront these problems, this paper proposes a novel approach for text classification that combines under-sampling with a semi-supervised learning method. In particular, the proposed semi-supervised method is specially suited to work with very few training examples and considers the automatic extraction of untagged data from the Web. Experimental results on a subset of Reuters-21578 text collection indicate that the proposed approach can be a practical solution for dealing with the class-imbalance problem, since it allows achieving very good results using very small training sets.