Document Length Normalization by Statistical Regression

Authors:
Sylvain Lamprier;Tassadit Amghar;Bernard Levrat;Frederic Saubion
Affiliations:
-;-;-;-
Venue:
ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
Year:
2007

Citing 0
Cited 2

Thematic Segment Retrieval Revisited

AIMSA '08 Proceedings of the 13th international conference on Artificial Intelligence: Methodology, Systems, and Applications
Enhancing ad-hoc relevance weighting using probability density estimation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The document-length normalization problem has been widely studied in the field of Information Retrieval. The Cosine Normalization [2], the Maximum tf Normalization [1] and the Byte Length Normalization [12] are the most commonly used normalization techniques. In [14], authors studied the retrieval probability of documents w.r.t. their size, using different similarity measures. They have shown that none of existing measures retrieve the documents of dif- ferent lengths with the same probability. We first show here that the document and query sizes are indeed very influent on the similarity score expectation. Therefore, we propose to realize a statistical regression of the similarity scores dis- tribution w.r.t. document and query sizes in order to normal- ize them. Experimental results appear to indicate that our approach, as well in the field of classical Information Re- trieval as when applied to a document clustering process, allows to judge similarities really more fairly.