Bias and the limits of pooling

Authors:
Chris Buckley;Darrin Dimmick;Ian Soboroff;Ellen Voorhees
Affiliations:
Sabir Research, Inc;National Institute of Standards and Technology;National Institute of Standards and Technology;National Institute of Standards and Technology
Venue:
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2006

Citing 3
Cited 19

Optimization of relevance feedback weights

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

The search problem posed by large heterogeneous data sets in litigation: possible future approaches to research

Proceedings of the 11th international conference on Artificial intelligence and law
Modelling epistemic uncertainty in ir evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of pooled and sampled relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of phrasal query suggestions

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Evaluating epistemic uncertainty under incomplete assessments

Information Processing and Management: an International Journal
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On test collections for adaptive information retrieval

Information Processing and Management: an International Journal
Query side evaluation: an empirical analysis of effectiveness and effort

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Implementing and evaluating phrasal query suggestions for proximity search

Information Systems
Implementing and evaluating phrasal query suggestions for proximity search

Information Systems
Measuring the reusability of test collections

Proceedings of the third ACM international conference on Web search and data mining
A retrieval evaluation methodology for incomplete relevance assessments

ECIR'07 Proceedings of the 29th European conference on IR research
A statistical view of binned retrieval models

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
PRES: a score metric for evaluating recall-oriented information retrieval applications

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Examining the robustness of evaluation metrics for patent retrieval with incomplete relevance judgements

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Evaluation of information retrieval for E-discovery

Artificial Intelligence and Law
Automated functional testing of online search services

Software Testing, Verification & Reliability
An analysis of crowd workers mistakes for specific and complex relevance assessment task

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Choices in batch information retrieval evaluation

Proceedings of the 18th Australasian Document Computing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. As document sets grow larger, a constant-size pool represents an increasingly small percentage of the document set, and at some point the assumption of approximately complete judgments must become invalid.This paper demonstrates that the AQUAINT 2005 test collection exhibits bias caused by pools that were too shallow for the document set size despite having many diverse runs contribute to the pools. The existing judgment set favors relevant documents that contain topic title words even though relevant documents containing few topic title words are known to exist in the document set. The paper concludes with suggested modifications to traditional pooling and evaluation methodology that may allow very large reusable test collections to be built.