Less is more: probabilistic models for retrieving fewer relevant documents

  • Authors:
  • Harr Chen;David R. Karger

  • Affiliations:
  • MIT CSAIL, Cambridge, MA;MIT CSAIL, Cambridge, MA

  • Venue:
  • SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Traditionally, information retrieval systems aim to maximize thenumber of relevant documents returned to a user within some windowof the top. For that goal, the probability ranking principle, whichranks documents in decreasing order of probability of relevance, isprovably optimal. However, there are many scenarios in which thatranking does not optimize for the users information need. Oneexample is when the user would be satisfied with some limitednumber of relevant documents, rather than needing all relevantdocuments. We show that in such a scenario, an attempt to returnmany relevant documents can actually reduce the chances of findingany relevant documents.We consider a number of information retrieval metrics from theliterature, including the rank of the first relevant result, the%no metric that penalizes a system only for retrieving no relevantresults near the top, and the diversity of retrieved results whenqueries have multiple interpretations. We observe that given aprobabilistic model of relevance, it is appropriate to rank so asto directly optimize these metrics in expectation. While doing somay be computationally intractable, we show that a simple greedyoptimization algorithm that approximately optimizes the givenobjectives produces rankings for TREC queries that outperform thestandard approach based on the probability ranking principle.