On the use of linear programming for unsupervised text classification

  • Authors:
  • Mark Sandler

  • Affiliations:
  • Cornell University, Ithaca, NY

  • Venue:
  • Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a new algorithm for dimensionality reduction and unsupervised text classification. We use mixture models as underlying process of generating corpus and utilize a novel, L1-norm based approach introduced by Kleinberg and Sandler [19]. We show that our algorithm performs extremely well on large datasets, with peak accuracy approaching that of supervised learning based on Support Vector Machines (SVMs) with large training sets. The method is based on the same idea that underlies Latent Semantic Indexing (LSI). We find a good low-dimensional subspace of a feature space and project all documents into it. However our projection minimizes different error, and unlike LSI we build a basis, that in many cases corresponds to the actual topics. We present results of testing of our algorithm on the abstracts of arXiv - an electronic repository of scientific papers, and the 20 Newsgroup dataset - a small snapshot of 20 specific newsgroups.