Addressing diverse corpora with cluster-based term weighting

  • Authors:
  • Peter Organisciak

  • Affiliations:
  • University of Illinois at Urbana-Champaign, Champaign, IL, USA

  • Venue:
  • Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Highly heterogeneous collections present difficulties to term weighting models that are informed by corpus-level frequencies. Collections which span multiple languages or large time periods do not provide realistic statistics on which words are interesting to a system. This paper presents a case where diverse corpora can frustrate term weighting and proposes a modification that weighs documents according to their class or cluster within the collection. In cases of diverse corpora, the proposed modification better represents the intuitions behind corpus-level document frequencies.