Taking advantage of the web for text classification with imbalanced classes

Authors:
Rafael Guzmán-Cabrera;Manuel Montes-y-Gómez;Paolo Rosso;Luis Villaseñor-Pineda
Affiliations:
FIMEE, Universidad de Guanajuato, Mexico and DSIC, Universidad Politécnica de Valencia, Spain;LTL, Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico;DSIC, Universidad Politécnica de Valencia, Spain;LTL, Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico
Venue:
MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
Year:
2007

Citing 6
Cited 1

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Integrating Background Knowledge into Nearest-Neighbor Text Classification

ECCBR '02 Proceedings of the 6th European Conference on Advances in Case-Based Reasoning
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets

Semi-supervised cause identification from aviation safety reports

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

A problem of supervised approaches for text classification is that they commonly require high-quality training data to construct an accurate classifier. Unfortunately, in many real-world applications the training sets are extremely small and present imbalanced class distributions. In order to confront these problems, this paper proposes a novel approach for text classification that combines under-sampling with a semi-supervised learning method. In particular, the proposed semi-supervised method is specially suited to work with very few training examples and considers the automatic extraction of untagged data from the Web. Experimental results on a subset of Reuters-21578 text collection indicate that the proposed approach can be a practical solution for dealing with the class-imbalance problem, since it allows achieving very good results using very small training sets.