Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A probabilistic model of information retrieval: development and comparative experiments
Information Processing and Management: an International Journal
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM Transactions on Information Systems (TOIS)
A formal study of information retrieval heuristics
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Simple BM25 extension to multiple weighted fields
Proceedings of the thirteenth ACM international conference on Information and knowledge management
On setting the hyper-parameters of term frequency normalization for information retrieval
ACM Transactions on Information Systems (TOIS)
When documents are very long, BM25 fails!
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Lower-bounding term frequency normalization
Proceedings of the 20th ACM international conference on Information and knowledge management
Lower-bounding term frequency normalization
Proceedings of the 20th ACM international conference on Information and knowledge management
A log-logistic model-based interpretation of TF normalization of BM25
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Composition of TF normalizations: new insights on scoring functions for ad hoc IR
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
A key component of BM25 contributing to its success is its sub linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show empirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occurrences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25.