A vector space model for automatic indexing
Communications of the ACM
A probabilistic model of information retrieval: development and comparative experiments
Information Processing and Management: an International Journal
Cluster-based retrieval using language models
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Hi-index | 0.00 |
Highly heterogeneous collections present difficulties to term weighting models that are informed by corpus-level frequencies. Collections which span multiple languages or large time periods do not provide realistic statistics on which words are interesting to a system. This paper presents a case where diverse corpora can frustrate term weighting and proposes a modification that weighs documents according to their class or cluster within the collection. In cases of diverse corpora, the proposed modification better represents the intuitions behind corpus-level document frequencies.