Scaling up semi-supervised learning: an efficient and effective LLGC variant

  • Authors:
  • Bernhard Pfahringer;Claire Leschi;Peter Reutemann

  • Affiliations:
  • Department of Computer Science, University of Waikato, Hamilton, New Zealand;INSA Lyon, France;Department of Computer Science, University of Waikato, Hamilton, New Zealand

  • Venue:
  • PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semisupervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.