Human subjectivity and performance limits in document retrieval
Information Processing and Management: an International Journal - Special issue: history of information science
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Estimating average precision with incomplete and imperfect judgments
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Bias and the limits of pooling for large collections
Information Retrieval
Journal of the American Society for Information Science and Technology
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating classifiers by means of test data with noisy labels
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Document categorization in legal electronic discovery: computer classification vs. manual review
Journal of the American Society for Information Science and Technology
CLEF-IP 2009: retrieval experiments in the intellectual property domain
CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Quantifying test collection quality based on the consistency of relevance judgements
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Time to judge relevance as an indicator of assessor error
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
An analysis of systematic judging errors in information retrieval
Proceedings of the 21st ACM international conference on Information and knowledge management
Alternative assessor disagreement and retrieval depth
Proceedings of the 21st ACM international conference on Information and knowledge management
Approximate Recall Confidence Intervals
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
Several important information retrieval tasks, including those in medicine, law, and patent review, have an authoritative standard of relevance, and are concerned about retrieval completeness. During the evaluation of retrieval effectiveness in these domains, assessors make errors in applying the standard of relevance, and the impact of these errors, particularly on estimates of recall, is of crucial concern. Using data from the interactive task of the TREC Legal Track, this paper investigates how reliably the yield of relevant documents can be estimated from sampled assessments in the presence of assessor error, particularly where sampling is stratified based upon the results of participating retrieval systems. We show that assessor error is in general a greater source of inaccuracy than sampling error. A process of appeal and adjudication, such as used in the interactive task, is found to be effective at locating many assessment errors; but the process is expensive if complete, and biased if incomplete. An unbiased double-sampling method for resolving assessment error is proposed, and shown on representative data to be more efficient and accurate than appeal-based adjudication.