Feature selection strategies for text categorization

  • Authors:
  • Pascal Soucy;Guy W. Mineau

  • Affiliations:
  • Copernic Research, Copernic Inc., Québec, Canada and Department of Computer Science, Université Laval, Québec, Canada;Department of Computer Science, Université Laval, Québec, Canada

  • Venue:
  • AI'03 Proceedings of the 16th Canadian society for computational studies of intelligence conference on Advances in artificial intelligence
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature selection is an important research issue in text categorization. The reason for this is that thousands of features are often involved, even when the simplest document representation model, the so-called bag-of-words, is used. Among the many approaches to feature selection, the use of some scoring function to rank features to filter them out is an important one. Many of these functions are widely used in text categorization. In past feature selection studies, most researchers have focused on comparing these measures in terms of accuracy achieved. For any measure, however, there are many selection strategies that can be applied to produce the resulting feature set. In this paper, we compare some such strategies and propose a new one. Tests have been conducted to compare five selection strategies on four datasets, using three distinct classifiers and four common feature scoring functions. As a result, it is possible to better understand which strategies are suited to particular classification settings.