An analysis on document length retrieval trends in language modeling smoothing

Authors:
David E. Losada;Leif Azzopardi
Affiliations:
Departamento de Electrónica y Computación, Universidad de Santiago de Compostela, Santiago, Spain;Department of Computing Science, University of Glasgow, Glasgow, Scotland
Venue:
Information Retrieval
Year:
2008

Citing 15
Cited 19

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Two-stage language models for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document normalization revisited

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance weighting for query independent evidence

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Web-centric language models

Proceedings of the 14th ACM international conference on Information and knowledge management
Age dependent document priors in link structure analysis

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Assessing multivariate Bernoulli models for information retrieval

ACM Transactions on Information Systems (TOIS)
Revisiting the relationship between document length and relevance

Proceedings of the 17th ACM conference on Information and knowledge management
Ranked feature fusion models for ad hoc retrieval

Proceedings of the 17th ACM conference on Information and knowledge management
Terminological cleansing for improved information retrieval based on ontological terms

Proceedings of the WSDM '09 Workshop on Exploiting Semantic Annotations in Information Retrieval
A relevance model for a data warehouse contextualized with documents

Information Processing and Management: an International Journal
Positional language models for information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Compression-based document length prior for language models

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A query model based on normalized log-likelihood

Proceedings of the 18th ACM conference on Information and knowledge management
Ontology refinement for improved information retrieval

Information Processing and Management: an International Journal
Unsupervised estimation of dirichlet smoothing parameters

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Combining term-based and category-based representations for entity search

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Enhancing ad-hoc relevance weighting using probability density estimation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Extending the language modeling framework for sentence retrieval to include local context

Information Retrieval
Query modeling for entity search based on terms, categories, and examples

ACM Transactions on Information Systems (TOIS)
Category-based query modeling for entity search

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Towards a better understanding of language model information retrieval

FDIA'08 Proceedings of the 2nd BCS IRSG conference on Future Directions in Information Access
Credibility-inspired ranking for blog post retrieval

Information Retrieval
Probabilistic co-relevance for query-sensitive similarity measurement in information retrieval

Information Processing and Management: an International Journal
Bridging memory-based collaborative filtering and text retrieval

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document length is widely recognized as an important factor for adjusting retrieval systems. Many models tend to favor the retrieval of either short or long documents and, thus, a length-based correction needs to be applied for avoiding any length bias. In Language Modeling for Information Retrieval, smoothing methods are applied to move probability mass from document terms to unseen words, which is often dependant upon document length. In this article, we perform an in-depth study of this behavior, characterized by the document length retrieval trends, of three popular smoothing methods across a number of factors, and its impact on the length of documents retrieved and retrieval performance. First, we theoretically analyze the Jelinek---Mercer, Dirichlet prior and two-stage smoothing strategies and, then, conduct an empirical analysis. In our analysis we show how Dirichlet prior smoothing caters for document length more appropriately than Jelinek---Mercer smoothing which leads to its superior retrieval performance. In a follow up analysis, we posit that length-based priors can be used to offset any bias in the length retrieval trends stemming from the retrieval formula derived by the smoothing technique. We show that the performance of Jelinek---Mercer smoothing can be significantly improved by using such a prior, which provides a natural and simple alternative to decouple the query and document modeling roles of smoothing. With the analysis of retrieval behavior conducted in this article, it is possible to understand why the Dirichlet Prior smoothing performs better than the Jelinek---Mercer, and why the performance of the Jelinek---Mercer method is improved by including a length-based prior.