Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Ranking retrieval systems without relevance judgments
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation
CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Forming test collections with no system pooling
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Challenges in running a commercial search engine
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 14th ACM international conference on Information and knowledge management
Robust test collections for retrieval evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Deconstructing nuggets: the stability and reliability of complex question answering evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategic system comparisons via targeted relevance judgments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A new approach for evaluating query expansion: query-document term mismatch
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Performance prediction using spatial autocorrelation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Power and bias of subset pooling strategies
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Active exploration for learning rankings from clickthrough data
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Repeatable evaluation of search services in dynamic environments
ACM Transactions on Information Systems (TOIS)
Inferring document relevance from incomplete information
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hypothesis testing with incomplete relevance judgments
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Semiautomatic evaluation of retrieval systems using document similarities
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Evaluating epistemic uncertainty under incomplete assessments
Information Processing and Management: an International Journal
How robust are multilingual information retrieval systems?
Proceedings of the 2008 ACM symposium on Applied computing
A simple and efficient sampling method for estimating AP and NDCG
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision when judgments are incomplete
Knowledge and Information Systems
Rank-biased precision for measurement of retrieval effectiveness
ACM Transactions on Information Systems (TOIS)
How does clickthrough data reflect retrieval quality?
Proceedings of the 17th ACM conference on Information and knowledge management
Using Multiple Query Aspects to Build Test Collections without Human Relevance Judgments
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Investigating Learning Approaches for Blog Post Opinion Retrieval
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Score adjustment for correction of pooling bias
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Document selection methodologies for efficient and effective learning-to-rank
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A few good topics: Experiments in topic set reduction for retrieval evaluation
ACM Transactions on Information Systems (TOIS)
Expected reciprocal rank for graded relevance
Proceedings of the 18th ACM conference on Information and knowledge management
A retrieval evaluation methodology for incomplete relevance assessments
ECIR'07 Proceedings of the 29th European conference on IR research
Annotations and digital libraries: designing adequate test-beds
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Here or there: preference judgments for relevance
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Active learning for ranking through expected loss optimization
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Learning more powerful test statistics for click-based retrieval evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
The effect of assessor error on IR system evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Reusable test collections through experimental design
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Positional relevance model for pseudo-relevance feedback
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Retrieval system evaluation: automatic evaluation versus incomplete judgments
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Crowdsourcing document relevance assessment with Mechanical Turk
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Crowdsourcing for search evaluation
ACM SIGIR Forum
Research methodology in studies of assessor effort for information retrieval evaluation
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Using clustering to improve retrieval evaluation without relevance judgments
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Diagnostic Evaluation of Information Retrieval Models
ACM Transactions on Information Systems (TOIS)
Evaluating new search engine configurations with pre-existing judgments and clicks
Proceedings of the 20th international conference on World wide web
Crowdsourcing for search and data mining
ACM SIGIR Forum
Relevant knowledge helps in choosing right teacher: active query selection for ranking adaptation
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficiently collecting relevance information from clickthroughs for web retrieval system evaluation
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Pseudo test collections for learning web search ranking functions
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
The effects of choice in routing relevance judgments
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Selecting a subset of queries for acquisition of further relevance judgements
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Prioritizing relevance judgments to improve the construction of IR test collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Effectiveness beyond the first crawl tier
Proceedings of the 20th ACM international conference on Information and knowledge management
A nugget-based test collection construction paradigm
Proceedings of the 20th ACM international conference on Information and knowledge management
Evaluating large-scale distributed vertical search
Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Crowdsourcing for information retrieval
ACM SIGIR Forum
Large-scale validation and analysis of interleaved search evaluation
ACM Transactions on Information Systems (TOIS)
IR system evaluation using nugget-based test collections
Proceedings of the fifth ACM international conference on Web search and data mining
Towards minimal test collections for evaluation of audio music similarity and retrieval
Proceedings of the 21st international conference companion on World Wide Web
Automated functional testing of online search services
Software Testing, Verification & Reliability
Quality through flow and immersion: gamifying crowdsourced relevance assessments
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
An uncertainty-aware query selection model for evaluation of IR systems
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Active query selection for learning rankers
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation on focused tasks
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Constructing test collections by inferring document relevance via extracted relevant information
Proceedings of the 21st ACM international conference on Information and knowledge management
Active evaluation of ranking functions based on graded relevance
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Crowdsourcing for information retrieval: introduction to the special issue
Information Retrieval
Efficient ad-hoc search for personalized PageRank
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Pseudo test collections for training and tuning microblog rankers
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Building a web test collection using social media
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Is relevance hard work?: evaluating the effort of making relevant assessments
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
On Using Fewer Topics in Information Retrieval Evaluations
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Active evaluation of ranking functions based on graded relevance
Machine Learning
Learning to rank query suggestions for adhoc and diversity search
Information Retrieval
A new statistical strategy for pooling: ELI
Information Processing Letters
Choices in batch information retrieval evaluation
Proceedings of the 18th Australasian Document Computing Symposium
Active evaluation of ranking functions based on graded relevance (extended abstract)
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
The neglected user in music information retrieval research
Journal of Intelligent Information Systems
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation. A new way of looking at average precision leads to a natural algorithm for selecting documents to judge and allows us to estimate the degree of confidence by defining a distribution over possible document judgments. A study with annotators shows that this method can be used by a small group of researchers to rank a set of systems in under three hours with 95% confidence. Information retrieval metrics such as average precision require large sets of relevance judgments to be accurately estimated. Building these sets is infeasible and often inefficient for many real-world retrieval implementations. We present a new way of looking at average precision that allows us to estimate the confidence in an evaluation based on the size of the test collection. We use this to build an algorithm for selecting the best documents to judge to have maximum confidence in an evaluation with a minimal number of relevance judgments. A study with annotators shows how the algorithm can be used by a small group of researchers to quickly rank a set of systems with 95% confidence.