Improved categorical distribution difference feature selection for Chinese document categorization

  • Authors:
  • Qiang Li;Liang He;Xin Lin

  • Affiliations:
  • East China Normal University, Shanghai;East China Normal University, Shanghai;East China Normal University, Shanghai

  • Venue:
  • Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature selection is an important process to choose a subset of features relevant to a particular application in document classification. Firstly, based on the categorical document frequency probability (CDFP), CDFP_VM criterion was designed for feature selection. Secondly, a maximum conditional distribution factor was proposed to improve the CDFP_VM criterion further. The method has advantages in the case of choosing smaller number of features, especially for classes with small number of training documents. It keeps the best features in favor of neither high nor low DF frequency terms, thus improves the final performance of the document categorization system. We perform the experiments with the standard Fudan Chinese corpus and selected Sogou corpus as balanced and unbalanced corpus respectively. The experiment results demonstrate the effectiveness of the proposed feature selection method in Chinese document categorization.