Using LSI for text classification in the presence of background text

  • Authors:
  • Sarah Zelikovitz;Haym Hirsh

  • Affiliations:
  • Rutgers University, Piscataway, NJ;Rutgers University, Piscataway, NJ

  • Venue:
  • Proceedings of the tenth international conference on Information and knowledge management
  • Year:
  • 2001

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper presents work that uses Latent Semantic Indexing (LSI) for text classification. However, in addition to relying on labeled training data, we improve classification accuracy by also using unlabeled data and other forms of available "background" text in the classification process. Rather than performing LSI's singular value decomposition (SVD) process solely on the training data, we instead use an expanded term-by-document matrix that includes both the labeled data as well as any available and relevant background text. We report the performance of this approach on data sets both with and without the inclusion of the background text, and compare our work to other efforts that can incorporate unlabeled data and other background text in the classification process.