Using chi-square statistics to measure similarities for text categorization

  • Authors:
  • Yao-Tsung Chen;Meng Chang Chen

  • Affiliations:
  • Department of Computer Science and Information Engineering, National Penghu University of Science and Technology, No. 300, Liu-Ho Rd., Makung City, Penghu County 880, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 12.05

Visualization

Abstract

In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with TF*IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage.