Incorporating topical support documents into a small training set in text categorization

Authors:
Kyung Soon Lee
Affiliations:
Chonbuk National University, Jeonju, South Korea
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 7
Cited 0

A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Training Invariant Support Vector Machines

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Virtual examples for text classification with Support Vector Machines

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Query enrichment for web-query classification

ACM Transactions on Information Systems (TOIS)
Virtual relevant documents in text categorization with support vector machines

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores the incorporation of topical support documents into a training set as a means of compensating for a shortage of positive training data in text categorization. To support topical representation, our method applies a simple transformation to documents, i.e., making new documents from existing positive documents by squaring a conventional term weight. The topical support documents thus created not only are expected to preserve the topic, but even improve the topical representation by emphasizing terms with higher weights. Experiments with support vector machines showed the effectiveness on RCV1 collection with a small number of positive training data. Our topical support representation achieved 52.01% and 8.83% improvements for 33 and 56 categories of RCV1 Topic in micro-averaged F1 with less than 100 and 300 positive documents in learning, respectively. Result analyses based on robustness indicate that topical support documents contribute to a steady and stable improvement.