Unanimity and compromise among probability forecasters
Management Science
A maximum entropy approach to natural language processing
Computational Linguistics
Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
On the effectiveness of evaluating retrieval systems in the absence of relevance judgments
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A unified model for metasearch, pooling, and system evaluation
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Forming test collections with no system pooling
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Minimal test collections for retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic test collections: measuring search effectiveness on the live web
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Learning a ranking from pairwise preferences
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A formal approach to score normalization for meta-search
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Semiautomatic evaluation of retrieval systems using document similarities
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Statistical power in retrieval experimentation
Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias
Proceedings of the 17th ACM conference on Information and knowledge management
Score adjustment for correction of pooling bias
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Weighted Rank Correlation in Information Retrieval Evaluation
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Measuring the reusability of test collections
Proceedings of the third ACM international conference on Web search and data mining
Reusable test collections through experimental design
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluation effort, reliability and reusability in XML retrieval
Journal of the American Society for Information Science and Technology
Towards minimal test collections for evaluation of audio music similarity and retrieval
Proceedings of the 21st international conference companion on World Wide Web
Alternative assessor disagreement and retrieval depth
Proceedings of the 21st ACM international conference on Information and knowledge management
Estimating interleaved comparison outcomes from historical click data
Proceedings of the 21st ACM international conference on Information and knowledge management
Approximate Recall Confidence Intervals
ACM Transactions on Information Systems (TOIS)
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
Low-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time evaluation, it is not clear that they can be trusted when re-used to evaluate new systems. In this work, we formally define what it means for judgments to be reusable: the confidence in an evaluation of new systems can be accurately assessed from an existing set of relevance judgments. We then present a method for augmenting a set of relevance judgments with relevance estimates that require no additional assessor effort. Using this method practically guarantees reusability: with as few as five judgments per topic taken from only two systems, we can reliably evaluate a larger set of ten systems. Even the smallest sets of judgments can be useful for evaluation of new systems.