The ineffectiveness of within-document term frequency in text classification

Authors:
W. John Wilbur;Won Kim
Affiliations:
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
Venue:
Information Retrieval
Year:
2009

Citing 16
Cited 2

Automatic text processing

Automatic text processing
One term or two?

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Document preprocessing for naive Bayes classification and clustering with mixture of multinomials

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A comparison of event models for Naive Bayes anti-spam e-mail filtering

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Techniques for improving the performance of naive bayes for text classification

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Improving document clustering using Okapi BM25 feature weighting

Information Retrieval
Identifying well-formed biomedical phrases in MEDLINE® text

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

For the purposes of classification it is common to represent a document as a bag of words. Such a representation consists of the individual terms making up the document together with the number of times each term appears in the document. All classification methods make use of the terms. It is common to also make use of the local term frequencies at the price of some added complication in the model. Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the exponential-family approximation of the DCM (EDCM), as well as support vector machines (SVM). Although it is usually claimed that incorporating local word frequency in a document improves text classification performance, we here test whether such claims are true or not. In this paper we show experimentally that simplified forms of the MM, EDCM, and SVM models which ignore the frequency of each word in a document perform about at the same level as MM, DCM, EDCM and SVM models which incorporate local term frequency. We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM. We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier.