A Framework for Titled Document Categorization with Modified Multinomial Naivebayes Classifier

Authors:
Hang Guo;Lizhu Zhou
Affiliations:
Computer Science & Technology Department, 100084, Tsinghua University,Beijing, China;Computer Science & Technology Department, 100084, Tsinghua University,Beijing, China
Venue:
ADMA '07 Proceedings of the 3rd international conference on Advanced Data Mining and Applications
Year:
2007

Citing 7
Cited 0

Method combination for document filtering

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Guest Editors' Introduction to the Special Issue on Automated Text Categorization

Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Query type classification for web document retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Segmented document classification: problem and solution

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

Titled Documents (TD) are short text documents that are segmented into two parts: Heading Part and Excerpt Part. With the development of the Internet, TDs are widely used as papers, news, messages, etc. In this paper we discuss the problem of automatic TDs categorization. Unlike traditional text documents, TDs have short headings which have less useless words comparing to their excerpts. Though headings are usually short, their words are more important than other words. Based on this observation we propose a titled document classification framework using the widely used MNB classifier. This framework puts higher weight on the heading words at the cost of some excerpt words. By this means heading words play more important roles in classification than the traditional method. According to our experiments on four datasets that cover three types of documents, the performance of the classifier is improved by our approach.