Clusters, language models, and ad hoc information retrieval

Authors:
Oren Kurland;Lillian Lee
Affiliations:
Technion—Israel Institute of Technology, Haifa, Israel;Cornell University, Ithaca, NY
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2009

Citing 42
Cited 4

Comparison of hierarchic agglomerative clustering methods for document retrieval

The Computer Journal
Class-based n-gram models of natural language

Computational Linguistics
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The cluster hypothesis revisited

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Using interdocument similarity information in document retrieval systems

Readings in information retrieval
Exploring the similarity space

ACM SIGIR Forum
Improving automatic query expansion

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Document expansion for speech retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A general language model for information retrieval (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating document clustering for interactive information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Information Retrieval

Information Retrieval
Two-stage language models for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Passage retrieval based on language models

Proceedings of the eleventh international conference on Information and knowledge management
The effectiveness of query-specific hierarchic clustering in information retrieval

Information Processing and Management: an International Journal
Optimal Mixture Models in IR

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Error analysis of difficult TREC topics

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
A survey on the use of relevance feedback for information access systems

The Knowledge Engineering Review
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus structure, language models, and ad hoc information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Better than the real thing?: iterative pseudo-query processing using cluster-based language models

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
PageRank without hyperlinks: structural re-ranking using links induced by language models

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Re-ranking method based on inter-document distances

Information Processing and Management: an International Journal
Regularizing ad hoc retrieval scores

Proceedings of the 14th ACM international conference on Information and knowledge management
Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Improving the estimation of relevance models using large external corpora

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Representing clusters for retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Language model information retrieval with document expansion

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Inter-document similarities, language models, and ad hoc information retrieval

Inter-document similarities, language models, and ad hoc information retrieval
Relevance models for topic detection and tracking

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Regularizing query-based retrieval scores

Information Retrieval

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives

ACM Transactions on Information Systems (TOIS)
Improving retrieval of short texts through document expansion

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Exploring the cluster hypothesis, and cluster-based retrieval, over the web

Proceedings of the 21st ACM international conference on Information and knowledge management
A novel neighborhood based document smoothing model for information retrieval

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The language-modeling approach to information retrieval provides an effective statistical framework for tackling various problems and often achieves impressive empirical performance. However, most previous work on language models for information retrieval focused on document-specific characteristics, and therefore did not take into account the structure of the surrounding corpus, a potentially rich source of additional information. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in terms of mean average precision (MAP) and recall, and our new interpolation algorithm posts statistically significant performance improvements for both metrics over all six corpora tested. An important aspect of our work is the way we model corpus structure. In contrast to most previous work on cluster-based retrieval that partitions the corpus, we demonstrate the effectiveness of a simple strategy based on a nearest-neighbors approach that produces overlapping clusters.