Uncertainty and term selection in text categorization

  • Authors:
  • Charles M. E. E. Peters;Cornelis H. A. Koster

  • Affiliations:
  • Department of Computer Science, University of Nijmegen, Nijmegen, The Netherlands;Department of Computer Science, University of Nijmegen, Nijmegen, The Netherlands

  • Venue:
  • International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper discusses the notion of Uncertainty, which has a prominent place in the theory and experimental practice of modern Physics. It argues that the awareness of Uncertainty may also be of tremendous importance to the field of Information Retrieval, and in particular Text Categorization.As an application of Uncertainty in Text Categorization, a new criterion for Term Selection is described, which is based on the Uncertainty in Term Frequency across categories. This criterion allows to distinguish between low-quality (or "noisy") and high-quality ("stiff") terms.We describe an experiment investigating the effect of eliminating noisy and stiff terms in the context of text classification. In the experiment we applied the Rocchio and Winnow classification algorithms to a collection of newspaper items, a mono-classified subset of the well-known Reuters 21578 corpus.This investigation shows that both the local elimination of noisy terms and the global elimination of stiff terms can be used for Term Selection in Text Categorization.