Using typical testors for feature selection in text categorization

  • Authors:
  • Aurora Pons-Porrata;Reynaldo Gil-García;Rafael Berlanga-Llavori

  • Affiliations:
  • Center of Pattern Recognition and Data Mining, Universidad de Oriente, Santiago de Cuba, Cuba;Center of Pattern Recognition and Data Mining, Universidad de Oriente, Santiago de Cuba, Cuba;Computer Science, Universitat Jaume I, Castellón, Spain

  • Venue:
  • CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A major difficulty of text categorization problems is the high dimensionality of the feature space. Thus, feature selection is often performed in order to increase both the efficiency and effectiveness of the classification. In this paper, we propose a feature selection method based on Testor Theory. This criterion takes into account inter-feature relationships. We experimentally compared our method with the widely used information gain using two well-known classification algorithms: k-nearest neighbour and Support Vector Machine. Two benchmark text collections were chosen as the testbeds: Reuters- 21578 and Reuters Corpus Version 1 (RCV1-v2). We found that our method consistently outperformed information gain for both classifiers and both data collections, especially when aggressive feature selection is carried out.