When documents are very long, BM25 fails!

Authors:
Yuanhua Lv;ChengXiang Zhai
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 3
Cited 8

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A formal study of information retrieval heuristics

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Lower-bounding term frequency normalization

Proceedings of the 20th ACM international conference on Information and knowledge management
Adaptive term frequency normalization for BM25

Proceedings of the 20th ACM international conference on Information and knowledge management
Predicting Query Performance by Query-Drift Estimation

ACM Transactions on Information Systems (TOIS)
A log-logistic model-based interpretation of TF normalization of BM25

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Effects of language and topic size in patent IR: an empirical study

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Composition of TF normalizations: new insights on scoring functions for ad hoc IR

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
On exploiting content and citations together to compute similarity of scientific papers

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
On combining text-based and link-based similarity measures for scientific papers

Proceedings of the 2013 Research in Adaptive and Convergent Systems

Quantified Score

Hi-index	0.03

Visualization

Abstract

We reveal that the Okapi BM25 retrieval function tends to overly penalize very long documents. To address this problem, we present a simple yet effective extension of BM25, namely BM25L, which "shifts" the term frequency normalization formula to boost scores of very long documents. Our experiments show that BM25L, with the same computation cost, is more effective and robust than the standard BM25.