A Study of \chi^2-test for Text Categorization

  • Authors:
  • Yao-Tsung Chen;Meng Chang Chen

  • Affiliations:
  • National Penghu University, Taiwan;Academia Sinica, Taiwan

  • Venue:
  • WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose the \chi^2-classifier employing the \chi^2-test to test the homogeneity of two random samples of term vectors for text categorization decision. First, the properties of \chi^2-test for text categorization are studied. One of the advantages of \chi^2-test is that its significance level \alpha is the same as the miss rate that provides a foundation for theoretical performance guarantee. The \chi^2-classifier also considers term aggregation and selection methods to improve the categorization performance. Generally cosine similarity with TF*IDF weighting function performs reasonably well in text categorization. However, the performance of cosine similarity depends on the given threshold value, and its categorization performance may fluctuate even near the optimal threshold value. To alleviate the problems, the \chi^2-classifier proposes a combination of \chi^2-test and cosine similarity. Extensive experiment results have verified the properties of \chi^2-test and performance of the combined classifier.