Lower-bounding term frequency normalization

Authors:
Yuanhua Lv;ChengXiang Zhai
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 17
Cited 10

Probabilistic models in information retrieval

The Computer Journal - Special issue on information retrieval
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
A formal study of information retrieval heuristics

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
An exploration of axiomatic approaches to information retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Semantic term matching in axiomatic approaches to information retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
An exploration of proximity measures in information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions

Artificial Intelligence Review
A statistical approach to mechanized encoding and searching of literary information

IBM Journal of Research and Development
Information-based models for ad hoc IR

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Diagnostic Evaluation of Information Retrieval Models

ACM Transactions on Information Systems (TOIS)
When documents are very long, BM25 fails!

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Adaptive term frequency normalization for BM25

Proceedings of the 20th ACM international conference on Information and knowledge management

Adaptive term frequency normalization for BM25

Proceedings of the 20th ACM international conference on Information and knowledge management
A log-logistic model-based interpretation of TF normalization of BM25

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
An exploration of ranking heuristics in mobile local search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Query likelihood with negative query generation

Proceedings of the 21st ACM international conference on Information and knowledge management
A constraint to automatically regulate document-length normalisation

Proceedings of the 21st ACM international conference on Information and knowledge management
Composition of TF normalizations: new insights on scoring functions for ad hoc IR

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Tie Breaker: A Novel Way of Combining Retrieval Signals

Proceedings of the 2013 Conference on the Theory of Information Retrieval
Towards efficient indexing of arbitrary similarity: vision paper

ACM SIGMOD Record
Graph-of-word and TW-IDF: new approach to ad hoc IR

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Document Score Distribution Models for Query Performance Inference and Prediction

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.