Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization

  • Authors:
  • Jingyang Li;Maosong Sun

  • Affiliations:
  • National Laboratory of Intelligent Technology and Systems, Dept. of Computer Sci. & Tech., Tsinghua University, Beijing 100084, China;National Laboratory of Intelligent Technology and Systems, Dept. of Computer Sci. & Tech., Tsinghua University, Beijing 100084, China

  • Venue:
  • CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Traditional tfidf-like term weighting schemes have a rough statistic -- idfas the term weighting factor, which does not exploit the category information (category labels on documents) and intra-document information (the relative importance of a given term to a given document that contains it) from the training data for a text categorization task. We present here a more elaborate nonparametric probabilistic model to make use of this sort of information in the term weighting phase. idfis theoretically proved to be a rough approximation of this new term weighting factor. This work is preliminary and mainly aiming at providing inspiration for further study on exploitation of this information, but it already provides a moderate performance boost on three popular document collections.