Using chi-square statistics to measure similarities for text categorization

Authors:
Yao-Tsung Chen;Meng Chang Chen
Affiliations:
Department of Computer Science and Information Engineering, National Penghu University of Science and Technology, No. 300, Liu-Ho Rd., Makung City, Penghu County 880, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 13
Cited 3

Lexical analysis and stoplists

Information retrieval
Representation and learning in information retrieval

Representation and learning in information retrieval
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
On relevance weights with little relevance information

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Improving text categorization methods for event tracking

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A formal study of information retrieval heuristics

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Discriminative models for information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management
A Study of \chi^2-test for Text Categorization

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence

A multi-classifier system for text categorization

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
A novel probabilistic feature selection method for text classification

Knowledge-Based Systems
The impact of preprocessing on text classification

Information Processing and Management: an International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with TF*IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage.