Features for unsupervised document classification

  • Authors:
  • S. H. Srinivasan

  • Affiliations:
  • Satyam Computer Services Ltd., Bangalore, India

  • Venue:
  • COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Unsupervised document classification is an important problem in practical text mining since training data is seldom available. In this paper we study the problem of term selection and the performance of various features for unsupervised text classification. The features studied are: principal components, independent components, and non-negative components. The clustering algorithm used is based on bipartite graph partitioning (Zha et al., 2001). The evaluation is performed using the newsgroups corpus.