Addressing diverse corpora with cluster-based term weighting

Authors:
Peter Organisciak
Affiliations:
University of Illinois at Urbana-Champaign, Champaign, IL, USA
Venue:
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Year:
2013

Citing 4
Cited 0

A vector space model for automatic indexing

Communications of the ACM
A probabilistic model of information retrieval: development and comparative experiments

Information Processing and Management: an International Journal
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Highly heterogeneous collections present difficulties to term weighting models that are informed by corpus-level frequencies. Collections which span multiple languages or large time periods do not provide realistic statistics on which words are interesting to a system. This paper presents a case where diverse corpora can frustrate term weighting and proposes a modification that weighs documents according to their class or cluster within the collection. In cases of diverse corpora, the proposed modification better represents the intuitions behind corpus-level document frequencies.