Improving text categorization using domain knowledge

Authors:
Jingbo Zhu;Wenliang Chen
Affiliations:
Natural Language Processing Lab, Institute of Computer Software and Theory, Northeastern University, Shenyang, P.R. China;Natural Language Processing Lab, Institute of Computer Software and Theory, Northeastern University, Shenyang, P.R. China
Venue:
NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Year:
2005

Citing 7
Cited 0

Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Automatic word clustering for text categorization using global information

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we mainly study and propose an approach to improve document classification using domain knowledge. First we introduce a domain knowledge dictionary NEUKD, and propose two models which use domain knowledge as textual features for text categorization. The first one is BOTW model which uses domain associated terms and conventional words as textual features. The other one is BOF model which uses domain features as textual features. But due to limitation of size of domain knowledge dictionary, we study and use a machine learning technique to solve the problem, and propose a BOL model which could be considered as the extended version of BOF model. In the comparison experiments, we consider naïve Bayes system based on BOW model as baseline system. Comparison experimental results of naïve Bayes systems based on those four models (BOW, BOTW, BOF and BOL) show that domain knowledge is very useful for improving text categorization. BOTW model performs better than BOW model, and BOL and BOF models perform better than BOW model in small number of features cases. Through learning new features using machine learning technique, BOL model performs better than BOF model.