Lexical analysis and stoplists
Information retrieval
Representation and learning in information retrieval
Representation and learning in information retrieval
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
On relevance weights with little relevance information
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Improving text categorization methods for event tracking
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A study of thresholding strategies for text categorization
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A formal study of information retrieval heuristics
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Discriminative models for information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Organizing structured web sources by query schemas: a clustering approach
Proceedings of the thirteenth ACM international conference on Information and knowledge management
A Study of \chi^2-test for Text Categorization
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
A multi-classifier system for text categorization
Proceedings of the 2011 ACM Symposium on Research in Applied Computation
A novel probabilistic feature selection method for text classification
Knowledge-Based Systems
The impact of preprocessing on text classification
Information Processing and Management: an International Journal
Hi-index | 12.05 |
In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with TF*IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage.