Extreme value theory applied to document retrieval from large collections

Authors:
David Madigan;Yehuda Vardi;Ishay Weissman
Affiliations:
Avaya Labs, USA;Rutgers University, USA;Technion, Israel
Venue:
Information Retrieval
Year:
2006

Citing 5
Cited 3

Scaling Up the TREC Collection

Information Retrieval
The score-distributional threshold optimization for adaptive binary classification tasks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
On Collection Size and Retrieval Effectiveness

Information Retrieval
Using asymmetric distributions to improve text classifier probability estimates

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

Investigating performance predictors using monte carlo simulation and score distribution models

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Modelling Score Distributions Without Actual Scores

Proceedings of the 2013 Conference on the Theory of Information Retrieval
Document Score Distribution Models for Query Performance Inference and Prediction

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections. Such applications represent a primary focus of the annual Text Retrieval Conference (TREC), where the participants compare the empirical performance of different approaches. P(K), the proportion of the top K documents that are relevant, is a popular measure of retrieval effectiveness.Participants in the TREC Very Large Corpus track have observed that when the target is a random sample from a collection, P(K) is substantially smaller than when the target is the entire collection. Hawking and Robertson (2003) confirmed this finding in a number of experimental settings. Hawking et al. (1999) posed as an open research question the cause of this phenomenon and proposed five possible explanatory hypotheses. In this paper, we present a mathematical analysis that sheds some light on these hypotheses and complements the experimental work of Hawking and Robertson (2003). We will also introduce C(L), contamination at L, the number of irrelevant documents amongst the top L relevant documents, and describe its properties.Our analysis shows that while P(K) typically will increase with collection size, the phenomenon is not universal. That is, the asymptotic behavior of P(K) and C(L) depends on the score distributions and relative proportions of relevant and irrelevant documents in the collection.