A statistical method for system evaluation using incomplete judgments

Authors:
Javed A. Aslam;Virgil Pavlu;Emine Yilmaz
Affiliations:
Northeastern University, Boston, MA;Northeastern University, Boston, MA;Northeastern University, Boston, MA
Venue:
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2006

Citing 8
Cited 60

Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

21st Annual ACM/SIGIR International Conference on Research and Development in Information Retrieval
Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A unified model for metasearch, pooling, and system evaluation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Measure-based metasearch

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The text retrieval conferences (TRECS)

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998

Inferring document relevance via average precision

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
On rank correlation in information retrieval evaluation

ACM SIGIR Forum
Robust test collections for retrieval evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reliable information retrieval evaluation with incomplete and biased judgements

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Alternatives to Bpref

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
On the robustness of relevance measures with incomplete judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategic system comparisons via targeted relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Performance prediction using spatial autocorrelation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of pooled and sampled relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Power and bias of subset pooling strategies

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Repeatable evaluation of search services in dynamic environments

ACM Transactions on Information Systems (TOIS)
Inferring document relevance from incomplete information

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Semiautomatic evaluation of retrieval systems using document similarities

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
On information retrieval metrics designed for evaluation with incomplete relevance assessments

Information Retrieval
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A simple and efficient sampling method for estimating AP and NDCG

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A new interpretation of average precision

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision when judgments are incomplete

Knowledge and Information Systems
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
Finding Transport Proteins in a General Protein Database

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Augmenting Data Retrieval with Information Retrieval Techniques by Using Word Similarity

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
Local search: A guide for the information retrieval practitioner

Information Processing and Management: an International Journal
Visual word proximity and linguistics for semantic video indexing and near-duplicate retrieval

Computer Vision and Image Understanding
If I Had a Million Queries

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
PSkip: estimating relevance ranking quality from web search clickthrough data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Score adjustment for correction of pooling bias

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Document selection methodologies for efficient and effective learning-to-rank

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Semantic context transfer across heterogeneous sources for domain adaptive video search

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Weighted Rank Correlation in Information Retrieval Evaluation

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions

ECIR'07 Proceedings of the 29th European conference on IR research
On statistical analysis and optimization of information retrieval effectiveness metrics

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Query quality: user ratings and system predictions

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Retrieval system evaluation: automatic evaluation versus incomplete judgments

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A comparison of user and system query performance predictions

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Crowdsourcing for search evaluation

ACM SIGIR Forum
Research methodology in studies of assessor effort for information retrieval evaluation

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Using clustering to improve retrieval evaluation without relevance judgments

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Evaluation effort, reliability and reusability in XML retrieval

Journal of the American Society for Information Science and Technology
Crowdsourcing for search and data mining

ACM SIGIR Forum
Evaluation of information retrieval for E-discovery

Artificial Intelligence and Law
Selecting a subset of queries for acquisition of further relevance judgements

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
A pseudo relevance feedback based cross domain video concept detection

Proceedings of the Third International Conference on Internet Multimedia Computing and Service
Prioritizing relevance judgments to improve the construction of IR test collections

Proceedings of the 20th ACM international conference on Information and knowledge management
An overview of Web search evaluation methods

Computers and Electrical Engineering
Crowdsourcing for information retrieval

ACM SIGIR Forum
A case for automatic system evaluation

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
On aggregating labels from multiple crowd workers to infer relevance of documents

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
An uncertainty-aware query selection model for evaluation of IR systems

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal
Active evaluation of ranking functions based on graded relevance

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Approximate Recall Confidence Intervals

ACM Transactions on Information Systems (TOIS)
High performance query expansion using adaptive co-training

Information Processing and Management: an International Journal
Crowdsourcing for information retrieval: introduction to the special issue

Information Retrieval
Active evaluation of ranking functions based on graded relevance

Machine Learning
A new statistical strategy for pooling: ELI

Information Processing Letters
Active evaluation of ranking functions based on graded relevance (extended abstract)

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of large-scale retrieval evaluation, and we propose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively expensive, relevance judgment sets, (2) produce biased estimates of standard performance measures, or (3) produce estimates of non-standard measures thought to be correlated with these standard measures, our proposed statistical technique produces unbiased estimates of the standard measures themselves.Our proposed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dependent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our proposed technique using benchmark TREC data, demonstrating that a sampling pool derived from a set of runs can be used to efficiently and effectively evaluate those runs. We further show that these sampling pools generalize well to unseen runs. Our experiments indicate that highly accurate estimates of standard performance measures can be obtained using a number of relevance judgments as small as 4% of the typical TREC-style judgment pool.