Document Length Normalization by Statistical Regression

  • Authors:
  • Sylvain Lamprier;Tassadit Amghar;Bernard Levrat;Frederic Saubion

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The document-length normalization problem has been widely studied in the field of Information Retrieval. The Cosine Normalization [2], the Maximum tf Normalization [1] and the Byte Length Normalization [12] are the most commonly used normalization techniques. In [14], authors studied the retrieval probability of documents w.r.t. their size, using different similarity measures. They have shown that none of existing measures retrieve the documents of dif- ferent lengths with the same probability. We first show here that the document and query sizes are indeed very influent on the similarity score expectation. Therefore, we propose to realize a statistical regression of the similarity scores dis- tribution w.r.t. document and query sizes in order to normal- ize them. Experimental results appear to indicate that our approach, as well in the field of classical Information Re- trieval as when applied to a document clustering process, allows to judge similarities really more fairly.