A comparative study on the use of labeled and unlabeled data for large margin classifiers

  • Authors:
  • Hiroya Takamura;Manabu Okumura

  • Affiliations:
  • Precision and Intelligence Laboratory, Tokyo Institute of Technology, Yokohama, Japan;Precision and Intelligence Laboratory, Tokyo Institute of Technology, Yokohama, Japan

  • Venue:
  • IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose to use both labeled and unlabeled data with the Expectation-Maximization (EM) algorithm in order to estimate the generative model and use this model to construct a Fisher kernel. The Naive Bayes generative probability is used to model a document. Through the experiments of text categorization, we empirically show that, (a) the Fisher kernel with labeled and unlabeled data outperforms Naive Bayes classifiers with EM and other methods for a sufficient amount of labeled data, (b) the value of additional unlabeled data diminishes when the labeled data size is large enough for estimating a reliable model, (c) the use of categories as latent variables is effective, and (d) larger unlabeled training datasets yield better results.