Improved categorical distribution difference feature selection for Chinese document categorization
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Hi-index | 0.00 |
Effective feature selection methods are essential for improving the accuracy and efficiency of text categorization. Motivated by document frequency, we proposed a new filter-based feature selection approach, called categorical document frequency. The categorical document frequency displays the distribution of a term over each category. Mathematically, the variance of a term reflects the contribution of the term to categorization. Finally, the experiments are carried out on the Reuters-21578 standard text corpus. The results showed that the categorization performance of the proposed approach is similar or better than information gain and chi-square statistic. In addition, computational cost of this approach is lower than information gain and chi-square so that it is also well-suited for processing large-scale text data.