CBC: Clustering Based Text Classification Requiring Minimal Labeled Data

  • Authors:
  • Hua-Jun Zeng;Xuan-Hui Wang;Zheng Chen;Hongjun Lu;Wei-Ying Ma

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Semi-supervised learning methods construct classifiersusing both labeled and unlabeled training data samples.While unlabeled data samples can help to improve theaccuracy of trained models to certain extent, existingmethods still face difficulties when labeled data is notsufficient and biased against the underlying datadistribution. In this paper, we present a clustering basedclassification (CBC) approach. Using this approach,training data, including both the labeled and unlabeleddata, is first clustered with the guidance of the labeleddata. Some of unlabeled data samples are then labeledbased on the clusters obtained. Discriminative classifierscan subsequently be trained with the expanded labeleddataset. The effectiveness of the proposed method isjustified analytically. Our experimental resultsdemonstrated that CBC outperforms existing algorithmswhen the size of labeled dataset is very small.