Inter-document similarities, language models, and ad hoc information retrieval

  • Authors:
  • Lillian Lee;Oren Kurland

  • Affiliations:
  • Cornell University;Cornell University

  • Venue:
  • Inter-document similarities, language models, and ad hoc information retrieval
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Search engines have become a crucial tool for finding information in repositories containing large amounts of textual data in unstructured form (e.g., the Web). However, the task of ad hoc information retrieval, that is, finding documents within a corpus that are relevant to an information need specified using a query, remains a hard challenge. The language modeling approach to information retrieval provides an effective framework for approaching various problems and has yielded impressive empirical performance. However, most previous work on language models for information retrieval focuses on document-specific characteristics to estimate documents' language models, and therefore does not take into account the structure of the surrounding corpus, a potentially rich source of additional information. We present a novel perspective for approaching the task of ad hoc retrieval: information provided by document-based language models can be enhanced by the incorporation of information drawn from clusters of similar documents that are created offline. We present several retrieval algorithms that are natural instantiations of this idea and that post performance that is substantially better than that of the standard language modeling approach. We also show that the best performing of these algorithms posts state-of-the-art performance for structural re-ranking, wherein an initially retrieved subset of the documents is re-ranked to obtain high precision specifically among the first few documents, using inter-document similarities within the list as an extra information source. As further exploration of the re-ranking approach just described, and inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a graph-based framework that applies to document collections lacking hyperlink information. Specifically, centrality induced over graphs wherein links represent asymmetric language-model-based inter-document similarities constitutes the basis of effective re-ranking algorithms. Combining our two paradigms for similarity representation---i.e., clusters of documents and links representing language-model-based inter-item similarities---helps to improve the effectiveness of centrality-based approaches. For example, document "authoritativeness" as induced by the HITS algorithm over cluster-document graphs is a highly effective re-ranking criterion. Furthermore, "authoritative" clusters are shown to contain a high percentage of relevant documents.