Evaluation over thousands of queries

Authors:
Ben Carterette;Virgil Pavlu;Evangelos Kanoulas;Javed A. Aslam;James Allan
Affiliations:
University of Massachusetts Amherst, Amherst, MA, USA;Northeastern University, Boston, MA, USA;Northeastern University, Boston, MA, USA;Northeastern University, Boston, MA, USA;University of Massachusetts Amherst, Amherst, MA, USA
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 12
Cited 35

How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Measure-based metasearch

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Repeatable evaluation of information retrieval effectiveness in dynamic environments

Repeatable evaluation of information retrieval effectiveness in dynamic environments
Robust test collections for retrieval evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hypothesis testing with incomplete relevance judgments

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions

ECIR'07 Proceedings of the 29th European conference on IR research

Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias

Proceedings of the 17th ACM conference on Information and knowledge management
Revisiting IR Techniques for Collaborative Search Strategies

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
If I Had a Million Queries

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Score adjustment for correction of pooling bias

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Document selection methodologies for efficient and effective learning-to-rank

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Deep versus shallow judgments in learning to rank

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic set size redux

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Learning to Rank for Information Retrieval

Foundations and Trends in Information Retrieval
Empirical justification of the gain and discount function for nDCG

Proceedings of the 18th ACM conference on Information and knowledge management
Weighted Rank Correlation in Information Retrieval Evaluation

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Supervised query modeling using wikipedia

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Assessor error in stratified evaluation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Boiling down information retrieval test collections

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Evaluation effort, reliability and reusability in XML retrieval

Journal of the American Society for Information Science and Technology
Evaluation of information retrieval for E-discovery

Artificial Intelligence and Law
Fractional similarity: cross-lingual feature selection for search

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Efficient and effective spam filtering and re-ranking for large web datasets

Information Retrieval
Selecting a subset of queries for acquisition of further relevance judgements

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Interest and Evaluation of Aggregated Search

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
A nugget-based test collection construction paradigm

Proceedings of the 20th ACM international conference on Information and knowledge management
Principles for robust evaluation infrastructure

Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation
IR system evaluation using nugget-based test collections

Proceedings of the fifth ACM international conference on Web search and data mining
On aggregating labels from multiple crowd workers to infer relevance of documents

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
An uncertainty-aware query selection model for evaluation of IR systems

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal
Differences in effectiveness across sub-collections

Proceedings of the 21st ACM international conference on Information and knowledge management
The Retrieval of Important News Stories by Influence Propagation among Communities and Categories

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Pseudo test collections for training and tuning microblog rankers

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A novel TF-IDF weighting scheme for effective ranking

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Building a web test collection using social media

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A new statistical strategy for pooling: ELI

Information Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information retrieval evaluation has typically been performed over several dozen queries, each judged to near-completeness. There has been a great deal of recent work on evaluation over much smaller judgment sets: how to select the best set of documents to judge and how to estimate evaluation measures when few judgments are available. In light of this, it should be possible to evaluate over many more queries without much more total judging effort. The Million Query Track at TREC 2007 used two document selection algorithms to acquire relevance judgments for more than 1,800 queries. We present results of the track, along with deeper analysis: investigating tradeoffs between the number of queries and number of judgments shows that, up to a point, evaluation over more queries with fewer judgments is more cost-effective and as reliable as fewer queries with more judgments. Total assessor effort can be reduced by 95% with no appreciable increase in evaluation errors.