Word distribution analysis for relevance ranking and query expansion

  • Authors:
  • Patricio Galeas;Bernd Freisleben

  • Affiliations:
  • Dept. of Mathematics and Computer Science, University of Marburg, Marburg, Germany;Dept. of Mathematics and Computer Science, University of Marburg, Marburg, Germany

  • Venue:
  • CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Apart from the frequency of terms in a document collection, the distribution of words plays an important role in determining the relevance of documents for a given search query. In this paper, word distribution analysis as a novel approach for using descriptive statistics to calculate a compressed representation of word positions in a document corpus is introduced. Based on this statistical approximation, two methods for improving the evaluation of document relevance are proposed: (a) a relevance ranking procedure based on how query terms are distributed over initially retrieved documents, and (b) a query expansion technique based on overlapping the distributions of terms in the top-ranked documents. Experimental results obtained for the TREC-8 document collection demonstrate that the proposed approach leads to an improvement of about 6.6% over the term frequency/inverse document frequency weighting scheme without applying query reformulation or relevance feedback techniques.