Assessing multivariate Bernoulli models for information retrieval

Authors:
David E. Losada;Leif Azzopardi
Affiliations:
Universidad de Santiago de Compostela, Spain;University of Glasgow, Scotland, UK
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2008

Citing 28
Cited 3

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

A language modeling approach to information retrieval
A vector space model for automatic indexing

Communications of the ACM
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Model-based feedback in the language modeling approach to information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Advances in Informational Retrieval: Recent Research from the Center for Intelligent Information Retrieval

Advances in Informational Retrieval: Recent Research from the Center for Intelligent Information Retrieval
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Two-stage language models for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Bayesian extension to the language model for ad hoc information retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Formal multiple-bernoulli models for language modeling

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A generative theory of relevance

A generative theory of relevance
Web-centric language models

Proceedings of the 14th ACM international conference on Information and knowledge management
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Aspects of sentence retrieval

Aspects of sentence retrieval
An analysis on document length retrieval trends in language modeling smoothing

Information Retrieval
A retrieval evaluation methodology for incomplete relevance assessments

ECIR'07 Proceedings of the 29th European conference on IR research
Age dependent document priors in link structure analysis

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Frequentist and bayesian approach to information retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
An efficient computation of the multiple-bernoulli language model

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Probabilistic self-organizing maps for qualitative data

Neural Networks
Extending the language modeling framework for sentence retrieval to include local context

Information Retrieval
A novel neighborhood based document smoothing model for information retrieval

Information Retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

Although the seminal proposal to introduce language modeling in information retrieval was based on a multivariate Bernoulli model, the predominant modeling approach is now centered on multinomial models. Language modeling for retrieval based on multivariate Bernoulli distributions is seen inefficient and believed less effective than the multinomial model. In this article, we examine the multivariate Bernoulli model with respect to its successor and examine its role in future retrieval systems. In the context of Bayesian learning, these two modeling approaches are described, contrasted, and compared both theoretically and computationally. We show that the query likelihood following a multivariate Bernoulli distribution introduces interesting retrieval features which may be useful for specific retrieval tasks such as sentence retrieval. Then, we address the efficiency aspect and show that algorithms can be designed to perform retrieval efficiently for multivariate Bernoulli models, before performing an empirical comparison to study the behaviorial aspects of the models. A series of comparisons is then conducted on a number of test collections and retrieval tasks to determine the empirical and practical differences between the different models. Our results indicate that for sentence retrieval the multivariate Bernoulli model can significantly outperform the multinomial model. However, for the other tasks the multinomial model provides consistently better performance (and in most cases significantly so). An analysis of the various retrieval characteristics reveals that the multivariate Bernoulli model tends to promote long documents whose nonquery terms are informative. While this is detrimental to the task of document retrieval (documents tend to contain considerable nonquery content), it is valuable for other tasks such as sentence retrieval, where the retrieved elements are very short and focused.