Removing smoothing from naive bayes text classifier

Authors:
Wang-bin Zhu;Ya-ping Lin;Mu Lin;Zhi-ping Chen
Affiliations:
Computer and Communication College, Hunan University, Changsha, Hunan, China;Computer and Communication College, Hunan University, Changsha, Hunan, China;Mathematics and Econometrics College, Hunan University, Changsha, Hunan, China;Computer and Communication College, Hunan University, Changsha, Hunan, China
Venue:
WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Year:
2005

Citing 7
Cited 0

An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Classification of Web Documents Using a Naive Bayes Method

ICTAI '03 Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Document preprocessing for naive Bayes classification and clustering with mixture of multinomials

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Smoothing is applied in Bayes classifier when the maximum likelihood (ML) estimate can’t solve the problem in the absence of some features in training data. However, smoothing doesn’t have firm theoretic base to rely on as ML estimate does. In this paper, we propose two novel strategies to remove smoothing from the classifier without sacrificing classification accuracy: NB_TF and NB_TS. NB_TF adjusts the classifier by adding the test document before classification and it is suitable for online categorization. NB_TS improves the performance by adding the whole test set to the classifier in the training stage and it is more efficient for batch categorization. The experiments and analysis show that NB_TS outperforms Laplace additive smoothing and Simple Good-Turing (SGT) smoothing, and NB_TF performs better than Laplace additive smoothing.