Text Classification from Labeled and Unlabeled Documents using EM

  • Authors:
  • Kamal Nigam;Andrew Kachites McCallum;Sebastian Thrun;Tom Mitchell

  • Affiliations:
  • School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. knigam@cs.cmu.edu;Just Research, 4616 Henry Street, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. thrun@cs.cmu.edu;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. tom.mitchell@cmu.edu

  • Venue:
  • Machine Learning - Special issue on information retrieval
  • Year:
  • 2000

Quantified Score

Hi-index 0.02

Visualization

Abstract

This paper shows that the accuracy of learned textclassifiers can be improved by augmenting a small number of labeledtraining documents with a large pool of unlabeled documents. This isimportant because in many text classification problems obtainingtraining labels is expensive, while large quantities of unlabeleddocuments are readily available.We introduce an algorithm for learning from labeled and unlabeleddocuments based on the combination of Expectation-Maximization (EM)and a naive Bayes classifier. The algorithm first trains a classifierusing the available labeled documents, and probabilistically labelsthe unlabeled documents. It then trains a new classifier using thelabels for all the documents, and iterates to convergence. This basicEM procedure works well when the data conform to the generativeassumptions of the model. However these assumptions are oftenviolated in practice, and poor performance can result. We present twoextensions to the algorithm that improve classification accuracy underthese conditions: (1) a weighting factor to modulate the contributionof the unlabeled data, and (2) the use of multiple mixture componentsper class. Experimental results, obtained using text from threedifferent real-world tasks, show that the use of unlabeled datareduces classification error by up to 30%.