Discriminative Features for Document Classification

  • Authors:
  • Kari Torkkola

  • Affiliations:
  • -

  • Venue:
  • ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 1 - Volume 1
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document representation using the bag-of-words approach may require bringing the dimensionality of the representation down in order to be able to make effective use of various statistical classification methods. Latent Semantic Indexing (LSI) is one such method that is based on eigendecomposition of the covariance of the document-term matrix. Another often used approach is to select a small number of most important features out of the whole set according to some relevant criterion. This paper points out that LSI ignores discrimination while concentrating on representation. Furthermore, selection methods fail to produce a feature set that jointly optimizes class discrimination. As a remedy, we suggest supervised linear discriminative transforms, and report good classification results applying these to the Reuters-21578 database.