Optimization of relevance feedback weights
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 11th international conference on Artificial intelligence and law
Modelling epistemic uncertainty in ir evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of pooled and sampled relevance judgments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of phrasal query suggestions
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Evaluating epistemic uncertainty under incomplete assessments
Information Processing and Management: an International Journal
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On test collections for adaptive information retrieval
Information Processing and Management: an International Journal
Query side evaluation: an empirical analysis of effectiveness and effort
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Implementing and evaluating phrasal query suggestions for proximity search
Information Systems
Implementing and evaluating phrasal query suggestions for proximity search
Information Systems
Measuring the reusability of test collections
Proceedings of the third ACM international conference on Web search and data mining
A retrieval evaluation methodology for incomplete relevance assessments
ECIR'07 Proceedings of the 29th European conference on IR research
A statistical view of binned retrieval models
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
PRES: a score metric for evaluating recall-oriented information retrieval applications
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Evaluation of information retrieval for E-discovery
Artificial Intelligence and Law
Automated functional testing of online search services
Software Testing, Verification & Reliability
An analysis of crowd workers mistakes for specific and complex relevance assessment task
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Choices in batch information retrieval evaluation
Proceedings of the 18th Australasian Document Computing Symposium
Hi-index | 0.00 |
Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. As document sets grow larger, a constant-size pool represents an increasingly small percentage of the document set, and at some point the assumption of approximately complete judgments must become invalid.This paper demonstrates that the AQUAINT 2005 test collection exhibits bias caused by pools that were too shallow for the document set size despite having many diverse runs contribute to the pools. The existing judgment set favors relevant documents that contain topic title words even though relevant documents containing few topic title words are known to exist in the document set. The paper concludes with suggested modifications to traditional pooling and evaluation methodology that may allow very large reusable test collections to be built.