Virtual relevant documents in text categorization with support vector machines

Authors:
Kyung-Soon Lee;Kyo Kageura
Affiliations:
Division of Electronics, and Information Engineering, Chonbuk National University, 664-14 Duckjin-gu Jeonju, Jeonbuk 561-756, Republic of Korea;Graduate School of Education, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
Venue:
Information Processing and Management: an International Journal
Year:
2007

Citing 16
Cited 5

The nature of statistical learning theory

The nature of statistical learning theory
Learning routing queries in a query zone

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
A vector space model for automatic indexing

Communications of the ACM
Topic difference factor extraction between two document sets and its application to text categorization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Training Invariant Support Vector Machines

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Incorporating Invariances in Support Vector Learning Machines

ICANN 96 Proceedings of the 1996 International Conference on Artificial Neural Networks
AdaBoosting Neural Networks: Application to on-line Character Recognition

ICANN '97 Proceedings of the 7th International Conference on Artificial Neural Networks
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Recognition and Structure from one 2D Model View: Observations on Prototypes, Object Classes and Symmetries

Recognition and Structure from one 2D Model View: Observations on Prototypes, Object Classes and Symmetries
Virtual examples for text classification with Support Vector Machines

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
A PAC-Style model for learning from labeled and unlabeled data

COLT'05 Proceedings of the 18th annual conference on Learning Theory

Incorporating topical support documents into a small training set in text categorization

Proceedings of the 17th ACM conference on Information and knowledge management
Developing a semantic-enable information retrieval mechanism

Expert Systems with Applications: An International Journal
Automatic text categorization based on content analysis with cognitive situation models

Information Sciences: an International Journal
A global-ranking local feature selection method for text categorization

Expert Systems with Applications: An International Journal
A generalized cluster centroid based classifier for text categorization

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores the incorporation of prior knowledge into support vector machines as a means of compensating for a shortage of training data in text categorization. The prior knowledge about transformation invariance is generated by a virtual document method. The method applies a simple transformation to documents, i.e., making virtual documents by combining relevant document pairs for a topic in the training set. The virtual document thus created not only is expected to preserve the topic, but even improve the topical representation by exploiting relevant terms that are not given high importance in individual real documents. Artificially generated documents result in the change in the distribution of training data without the randomization. Experiments with support vector machines based on linear, polynomial and radial-basis function kernels showed the effectiveness on Reuters-21578 set for the topics with a small number of relevant documents. The proposed method achieved 131%, 34%, 12% improvements in micro-averaged F"1 for 25, 46, and 58 topics with less than 10, 30, and 50 relevant documents in learning, respectively. The result analysis indicates that incorporating virtual documents contributes to a steady improvement on the performance.