Method combination for document filtering
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Web classification using support vector machine
Proceedings of the 4th international workshop on Web information and data management
Guest Editors' Introduction to the Special Issue on Automated Text Categorization
Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Query type classification for web document retrieval
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Segmented document classification: problem and solution
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Hi-index | 0.01 |
Titled Documents (TD) are short text documents that are segmented into two parts: Heading Part and Excerpt Part. With the development of the Internet, TDs are widely used as papers, news, messages, etc. In this paper we discuss the problem of automatic TDs categorization. Unlike traditional text documents, TDs have short headings which have less useless words comparing to their excerpts. Though headings are usually short, their words are more important than other words. Based on this observation we propose a titled document classification framework using the widely used MNB classifier. This framework puts higher weight on the heading words at the cost of some excerpt words. By this means heading words play more important roles in classification than the traditional method. According to our experiments on four datasets that cover three types of documents, the performance of the classifier is improved by our approach.