Summarization as feature selection for document categorization on small datasets

  • Authors:
  • Emmanuel Anguiano-Hernández;Luis Villaseñor-Pineda;Manuel Montes-y-Gómez;Paolo Rosso

  • Affiliations:
  • Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics, Mexico;Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics, Mexico;Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics, Mexico;Natural Language Engineering Lab, ELiRF, Department of Information Systems and Computation, Polytechnic University of Valencia, Spain

  • Venue:
  • IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most common feature selection techniques for document categorization are supervised and require lots of training data in order to accurately capture the descriptive and discriminative information from the defined categories. Considering that training sets are extremely small in many classification tasks, in this paper we explore the use of unsupervised extractive summarization as a feature selection technique for document categorization. Our experiments using training sets of different sizes indicate that text summarization is a competitive approach for feature selection, and show its appropriateness for situations having small training sets, where it could clearly outperform the traditional information gain technique.