A new probabilistic retrieval model based on the dirichlet compound multinomial distribution

Authors:
Zuobing Xu;Ram Akella
Affiliations:
University of California, Santa Cruz, Santa Cruz, CA, USA;University of California, Santa Cruz, Santa Cruz, CA, USA
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 15
Cited 13

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Deterministic annealing EM algorithm

Neural Networks
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A general language model for information retrieval

Proceedings of the eighth international conference on Information and knowledge management
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Model-based feedback in the language modeling approach to information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Two-stage language models for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
A formal study of information retrieval heuristics

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
Regularized estimation of mixture models for robust pseudo-relevance feedback

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Generalized inverse document frequency

Proceedings of the 17th ACM conference on Information and knowledge management
Active relevance feedback for difficult queries

Proceedings of the 17th ACM conference on Information and knowledge management
Knowledge sciences in services automation: integration models and perspectives for service centers

CASE'09 Proceedings of the fifth annual IEEE international conference on Automation science and engineering
Language models for web object retrieval

WiCOM'09 Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing
Improving probabilistic information retrieval by modeling burstiness of words

Information Processing and Management: an International Journal
Conceptual language models for domain-specific retrieval

Information Processing and Management: an International Journal
Information-based models for ad hoc IR

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A unified optimization framework for robust pseudo-relevance feedback algorithms

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Retrieval constraints and word frequency distributions a log-logistic model for IR

Information Retrieval
Hypergeometric language models for republished article finding

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Is document frequency important for PRF?

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
A log-logistic model-based interpretation of TF normalization of BM25

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
A Theoretical Analysis of Pseudo-Relevance Feedback Models

Proceedings of the 2013 Conference on the Theory of Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The classical probabilistic models attempt to capture the Ad hoc information retrieval problem within a rigorous probabilistic framework. It has long been recognized that the primary obstacle to effective performance of the probabilistic models is the need to estimate a relevance model. The Dirichlet compound multinomial (DCM) distribution, which relies on hierarchical Bayesian modeling techniques, or the Polya Urn scheme, is a more appropriate generative model than the traditional multinomial distribution for text documents. We explore a new probabilistic model based on the DCM distribution, which enables efficient retrieval and accurate ranking. Because the DCM distribution captures the dependency of repetitive word occurrences, the new probabilistic model is able to model the concavity of the score function more effectively. To avoid the empirical tuning of retrieval parameters, we design several parameter estimation algorithms to automatically set model parameters. Additionally, we propose a pseudo-relevance feedback algorithm based on the latent mixture modeling of the Dirichlet compound multinomial distribution to further improve retrieval accuracy. Finally, our experiments show that both the baseline probabilistic retrieval algorithm based on the DCM distribution and the corresponding pseudo-relevance feedback algorithm outperform the existing language modeling systems on several TREC retrieval tasks.