Categorical proportional difference: a feature selection method for text categorization

  • Authors:
  • Mondelle Simeon;Robert Hilderman

  • Affiliations:
  • University of Regina, Regina, Saskatchewan, Canada;University of Regina, Regina, Saskatchewan, Canada

  • Venue:
  • AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Supervised text categorization is a machine learning task where a predefined category label is automatically assigned to a previously unlabelled document based upon characteristics of the words contained in the document. Since the number of unique words in a learning task (i.e., the number of features) can be very large, the efficiency and accuracy of the learning task can be increased by using feature selection methods to extract from a document a subset of the features that are considered most relevant. In this paper, we introduce a new feature selection method called categorical proportional difference (CPD), a measure of the degree to which a word contributes to differentiating a particular category from other categories. The CPD for a word in a particular category in a text corpus is a ratio that considers the number of documents of a category in which the word occurs and the number of documents from other categories in which the word also occurs. We conducted a series of experiments to evaluate CPD when used in conjunction with SVM and Naive Bayes text classifiers on the OHSUMED, 20 Newsgroups, and Reuters-21578 text corpora. Recall, precision, and the F-measure were used as the measures of performance. The results obtained using CPD were compared to those obtained using six common feature selection methods found in the literature: χ2, information gain, document frequency, mutual information, odds ratio, and simplified χ2. Empirical results showed that, in general, according to the F-measure, CPD outperforms the other feature selection methods in four out of six text categorization tasks.