Pruning The Vocabulary For Better Context Recognition

  • Authors:
  • Rasmus Elsborg Madsen;Sigurdur Sigurdsson;Lars Kai Hansen;Jan Larsen

  • Affiliations:
  • Technical University of Denmark;Technical University of Denmark;Technical University of Denmark;Technical University of Denmark

  • Venue:
  • ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 2 - Volume 02
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Language independent 'bag-of-words' representations are surprisingly effective for text classification. The representation is high dimensional though, containing many non-consistent words for text categorization. These non-consistent words result in reduced generalization performance of sub-sequent classifiers, e.g., from ill-posed principal component transformations. In this communication our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gainfor determination of term relevancy, when pruning the vocabularies. With reduced vocabularies documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.