Probabilistic document length priors for language models

  • Authors:
  • Roi Blanco;Alvaro Barreiro

  • Affiliations:
  • IRLab., Computer Science Department, University of A Coruña, Spain;IRLab., Computer Science Department, University of A Coruña, Spain

  • Venue:
  • ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses the issue of devising a new document prior for the language modeling (LM) approach for Information Retrieval. The prior is based on term statistics, derived in a probabilistic fashion and portrays a novel way of considering document length. Furthermore, we developed a new way of combining document length priors with the query likelihood estimation based on the risk of accepting the latter as a score. This prior has been combined with a document retrieval language model that uses Jelinek-Mercer (JM), a smoothing technique which does not take into account document length. The combination of the prior boosts the retrieval performance, so that it outperforms a LM with a document length dependent smoothing component (Dirichlet prior) and other state of the art high-performing scoring function (BM25). Improvements are significant, robust across different collections and query sizes.