Bias and the limits of pooling for large collections

Authors:
Chris Buckley;Darrin Dimmick;Ian Soboroff;Ellen Voorhees
Affiliations:
Sabir Research, Inc., Gaithersburg, USA 20878;Information Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, USA 20899-8940;Information Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, USA 20899-8940;Information Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, USA 20899-8940
Venue:
Information Retrieval
Year:
2007

Citing 0
Cited 13

Generalized inverse document frequency

Proceedings of the 17th ACM conference on Information and knowledge management
Revisiting the relationship between document length and relevance

Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias

Proceedings of the 17th ACM conference on Information and knowledge management
Survey and evaluation of query intent detection methods

Proceedings of the 2009 workshop on Web Search Click Data
Comparative analysis of clicks and judgments for IR evaluation

Proceedings of the 2009 workshop on Web Search Click Data
Score adjustment for correction of pooling bias

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A query model based on normalized log-likelihood

Proceedings of the 18th ACM conference on Information and knowledge management
Automatic Search Engine Performance Evaluation with the Wisdom of Crowds

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Assessor error in stratified evaluation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Social book search: comparing topical relevance judgements and book suggestions for evaluation

Proceedings of the 21st ACM international conference on Information and knowledge management
Deciding on an adjustment for multiplicity in IR experiments

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Choices in batch information retrieval evaluation

Proceedings of the 18th Australasian Document Computing Symposium
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. Yet a constant-size pool represents an increasingly small percentage of the document set as document sets grow larger, and at some point the assumption of approximately complete judgments must become invalid. This paper shows that the judgment sets produced by traditional pooling when the pools are too small relative to the total document set size can be biased in that they favor relevant documents that contain topic title words. This phenomenon is wholly dependent on the collection size and does not depend on the number of relevant documents for a given topic. We show that the AQUAINT test collection constructed in the recent TREC 2005 workshop exhibits this biased relevance set; it is likely that the test collections based on the much larger GOV2 document set also exhibit the bias. The paper concludes with suggested modifications to traditional pooling and evaluation methodology that may allow very large reusable test collections to be built.