How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Repeatable evaluation of information retrieval effectiveness in dynamic environments
Repeatable evaluation of information retrieval effectiveness in dynamic environments
Robust test collections for retrieval evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hypothesis testing with incomplete relevance judgments
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions
ECIR'07 Proceedings of the 29th European conference on IR research
Statistical power in retrieval experimentation
Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias
Proceedings of the 17th ACM conference on Information and knowledge management
Revisiting IR Techniques for Collaborative Search Strategies
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Score adjustment for correction of pooling bias
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Document selection methodologies for efficient and effective learning-to-rank
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Deep versus shallow judgments in learning to rank
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Learning to Rank for Information Retrieval
Foundations and Trends in Information Retrieval
Empirical justification of the gain and discount function for nDCG
Proceedings of the 18th ACM conference on Information and knowledge management
Weighted Rank Correlation in Information Retrieval Evaluation
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
The effect of assessor error on IR system evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Supervised query modeling using wikipedia
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Assessor error in stratified evaluation
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Boiling down information retrieval test collections
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Evaluation effort, reliability and reusability in XML retrieval
Journal of the American Society for Information Science and Technology
Evaluation of information retrieval for E-discovery
Artificial Intelligence and Law
Fractional similarity: cross-lingual feature selection for search
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Efficient and effective spam filtering and re-ranking for large web datasets
Information Retrieval
Selecting a subset of queries for acquisition of further relevance judgements
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Interest and Evaluation of Aggregated Search
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
A nugget-based test collection construction paradigm
Proceedings of the 20th ACM international conference on Information and knowledge management
Principles for robust evaluation infrastructure
Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation
IR system evaluation using nugget-based test collections
Proceedings of the fifth ACM international conference on Web search and data mining
On aggregating labels from multiple crowd workers to infer relevance of documents
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
An uncertainty-aware query selection model for evaluation of IR systems
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment
Information Processing and Management: an International Journal
Differences in effectiveness across sub-collections
Proceedings of the 21st ACM international conference on Information and knowledge management
The Retrieval of Important News Stories by Influence Propagation among Communities and Categories
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
On the measurement of test collection reliability
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Pseudo test collections for training and tuning microblog rankers
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A novel TF-IDF weighting scheme for effective ranking
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Building a web test collection using social media
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A new statistical strategy for pooling: ELI
Information Processing Letters
Hi-index | 0.00 |
Information retrieval evaluation has typically been performed over several dozen queries, each judged to near-completeness. There has been a great deal of recent work on evaluation over much smaller judgment sets: how to select the best set of documents to judge and how to estimate evaluation measures when few judgments are available. In light of this, it should be possible to evaluate over many more queries without much more total judging effort. The Million Query Track at TREC 2007 used two document selection algorithms to acquire relevance judgments for more than 1,800 queries. We present results of the track, along with deeper analysis: investigating tradeoffs between the number of queries and number of judgments shows that, up to a point, evaluation over more queries with fewer judgments is more cost-effective and as reliable as fewer queries with more judgments. Total assessor effort can be reduced by 95% with no appreciable increase in evaluation errors.