Query-based inter-document similarity using probabilistic co-relevance model

  • Authors:
  • Seung-Hoon Na;In-Su Kang;Jong-Hyeok Lee

  • Affiliations:
  • POSTECH, Pohang, South Korea;KISTI, Daejeon, South Korea;POSTECH, Pohang, South Korea

  • Venue:
  • ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Inter-document similarity is the critical information which determines whether or not the cluster-based retrieval improves the baseline. However, a theoretical work on inter-document similarity has not been investigated, even though such work can provide a principle to define a more improved similarity in a well-motivated direction. To support this theory, this paper starts from pursuing an ideal inter-document similarity that optimally satisfies the cluster-hypothesis. We propose a probabilistic principle of inter-document similarities; the optimal similarity of two documents should be proportional to the probability that they are co-relevant to an arbitrary query. Based on this principle, the study of the inter-document similarity is formulated to attack the estimation problem of the co-relevance model of documents. Furthermore, we obtain that the optimal inter-document similarity should be defined using queries as its basic unit, not terms, namely a query-based similarity. We strictly derive a novel query-based similarity from the co-relevance model, without any heuristics. Experimental results show that the new query-based inter-document similarity significantly improves the previously-used term-based similarity in the context of Voorhee's evaluation measure.