Virtual relevant documents in text categorization with support vector machines

  • Authors:
  • Kyung-Soon Lee;Kyo Kageura

  • Affiliations:
  • Division of Electronics, and Information Engineering, Chonbuk National University, 664-14 Duckjin-gu Jeonju, Jeonbuk 561-756, Republic of Korea;Graduate School of Education, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper explores the incorporation of prior knowledge into support vector machines as a means of compensating for a shortage of training data in text categorization. The prior knowledge about transformation invariance is generated by a virtual document method. The method applies a simple transformation to documents, i.e., making virtual documents by combining relevant document pairs for a topic in the training set. The virtual document thus created not only is expected to preserve the topic, but even improve the topical representation by exploiting relevant terms that are not given high importance in individual real documents. Artificially generated documents result in the change in the distribution of training data without the randomization. Experiments with support vector machines based on linear, polynomial and radial-basis function kernels showed the effectiveness on Reuters-21578 set for the topics with a small number of relevant documents. The proposed method achieved 131%, 34%, 12% improvements in micro-averaged F"1 for 25, 46, and 58 topics with less than 10, 30, and 50 relevant documents in learning, respectively. The result analysis indicates that incorporating virtual documents contributes to a steady improvement on the performance.