Categorical Document Frequency Based Feature Selection for Text Categorization

  • Authors:
  • Zhilong Zhen;Haijuan Wang;Lixin Han;Zhan Shi

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICM '11 Proceedings of the 2011 International Conference of Information Technology, Computer Engineering and Management Sciences - Volume 02
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Effective feature selection methods are essential for improving the accuracy and efficiency of text categorization. Motivated by document frequency, we proposed a new filter-based feature selection approach, called categorical document frequency. The categorical document frequency displays the distribution of a term over each category. Mathematically, the variance of a term reflects the contribution of the term to categorization. Finally, the experiments are carried out on the Reuters-21578 standard text corpus. The results showed that the categorization performance of the proposed approach is similar or better than information gain and chi-square statistic. In addition, computational cost of this approach is lower than information gain and chi-square so that it is also well-suited for processing large-scale text data.