How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
ACM SIGIR Forum
A statistical method for system evaluation using incomplete judgments
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating the effectiveness of content-oriented XML retrieval methods
Information Retrieval
Estimating average precision with incomplete and imperfect judgments
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval
ACM Transactions on Information Systems (TOIS)
Robust test collections for retrieval evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reliable information retrieval evaluation with incomplete and biased judgements
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
On the robustness of relevance measures with incomplete judgments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Information Processing and Management: an International Journal
Evaluating epistemic uncertainty under incomplete assessments
Information Processing and Management: an International Journal
A new rank correlation coefficient for information retrieval
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Comparing metrics across TREC and NTCIR:: the robustness to pool depth bias
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Sound and complete relevance assessment for XML retrieval
ACM Transactions on Information Systems (TOIS)
Rank-biased precision for measurement of retrieval effectiveness
ACM Transactions on Information Systems (TOIS)
Overview of the INEX 2007 Ad Hoc Track
Focused Access to XML Documents
Focused Access to XML Documents
Statistical power in retrieval experimentation
Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias
Proceedings of the 17th ACM conference on Information and knowledge management
On rank correlation and the distance between rankings
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Overview of the INEX 2008 Ad Hoc Track
Advances in Focused Retrieval
INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Overview of the INEX 2010 ad hoc track
INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Retrieval evaluation on focused tasks
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
The Initiative for the Evaluation of XML retrieval (INEX) provides a TREC-like platform for evaluating content-oriented XML retrieval systems. Since 2007, INEX has been using a set of precision-recall based metrics for its ad hoc tasks. The authors investigate the reliability and robustness of these focused retrieval measures, and of the INEX pooling method. They explore four specific questions: How reliable are the metrics when assessments are incomplete, or when query sets are small? What is the minimum pool/query-set size that can be used to reliably evaluate systems? Can the INEX collections be used to fairly evaluate “new” systems that did not participate in the pooling process? And, for a fixed amount of assessment effort, would this effort be better spent in thoroughly judging a few queries, or in judging many queries relatively superficially? The authors' findings validate properties of precision-recall-based metrics observed in document retrieval settings. Early precision measures are found to be more error-prone and less stable under incomplete judgments and small topic-set sizes. They also find that system rankings remain largely unaffected even when assessment effort is substantially (but systematically) reduced, and confirm that the INEX collections remain usable when evaluating nonparticipating systems. Finally, they observe that for a fixed amount of effort, judging shallow pools for many queries is better than judging deep pools for a smaller set of queries. However, when judging only a random sample of a pool, it is better to completely judge fewer topics than to partially judge many topics. This result confirms the effectiveness of pooling methods. © 2011 Wiley Periodicals, Inc.