The effect of topic set size on retrieval experiment error

Authors:
Ellen M. Voorhees;Chris Buckley
Affiliations:
National Institute of Standards and Technology, Gaithersburg, MD;Sabir Research Inc, Gaithersburg, MD
Venue:
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2002

Citing 6
Cited 80

The state of retrieval system evaluation

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Presenting results of experimental retrieval comparisons

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Overview of the sixth text REtrieval conference (TREC-6)

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Blind Men and Elephants: Six Approaches to TREC data

Information Retrieval

Indexing for fast categorisation

ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
Searching XML documents via XML fragments

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Query expansion using associated queries

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Cross-Language Evaluation Forum: Objectives, Results, Achievements

Information Retrieval
Measuring retrieval effectiveness: a new proposal and a first experimental validation

Journal of the American Society for Information Science and Technology
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
On evaluating web search with very few relevant documents

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Measuring ineffectiveness

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Multi-Dimensional Evaluation of Information Retrieval Results

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
The effect of named entities on effectiveness in cross-language information retrieval evaluation

Proceedings of the 2005 ACM symposium on Applied computing
The TREC robust retrieval track

ACM SIGIR Forum
Evaluating the evaluation: a case study using the TREC 2002 question answering track

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of resources for question answering evaluation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Testing algorithms is like testing students

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Revisiting the effect of topic set size on retrieval error

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Building a reusable test collection for question answering

Journal of the American Society for Information Science and Technology - Research Articles
Dynamic test collections: measuring search effectiveness on the live web

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Give me just one highly relevant document: P-measure

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
On GMAP: and other transformations

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Investigating the exhaustivity dimension in content-oriented XML element retrieval evaluation

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval

ACM Transactions on Information Systems (TOIS)
Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

ACM Transactions on Asian Language Information Processing (TALIP)
Using question series to evaluate question answering system effectiveness

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
On the reliability of factoid question answering evaluation

ACM Transactions on Asian Language Information Processing (TALIP)
On the reliability of information retrieval metrics based on graded relevance

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Alternatives to Bpref

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hits hits TREC: exploring IR evaluation results with network analysis

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Repeatable evaluation of search services in dynamic environments

ACM Transactions on Information Systems (TOIS)
Evaluation of retrieval effectiveness with incomplete relevance data: Theoretical and experimental comparison of three measures

Information Processing and Management: an International Journal
Re-examining the effects of adding relevance information in a relevance feedback environment

Information Processing and Management: an International Journal
How robust are multilingual information retrieval systems?

Proceedings of the 2008 ACM symposium on Applied computing
On information retrieval metrics designed for evaluation with incomplete relevance assessments

Information Retrieval
Precision-at-ten considered redundant

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On test collections for adaptive information retrieval

Information Processing and Management: an International Journal
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
A New Shape Benchmark for 3D Object Retrieval

ISVC '08 Proceedings of the 4th International Symposium on Advances in Visual Computing
Topic set size redux

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
IR Evaluation without a Common Set of Topics

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
A few good topics: Experiments in topic set reduction for retrieval evaluation

ACM Transactions on Information Systems (TOIS)
DUC 2005: evaluation of question-focused summarization systems

SumQA '06 Proceedings of the Workshop on Task-Focused Summarization and Question Answering
So many topics, so little time

ACM SIGIR Forum
Evaluation of automatic summaries: metrics under varying data conditions

UCNLG+Sum '09 Proceedings of the 2009 Workshop on Language Generation and Summarisation
Comparing the sensitivity of information retrieval metrics

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Ranking related entities: components and analyses

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
GikiCLEF: expectations and lessons learned

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
CLEF-IP 2009: retrieval experiments in the intellectual property domain

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Boiling down information retrieval test collections

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Evaluation effort, reliability and reusability in XML retrieval

Journal of the American Society for Information Science and Technology
On the contributions of topics to system evaluation

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
A simple measure to assess non-response

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Evaluating diversified search results using per-intent graded relevance

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Selecting a subset of queries for acquisition of further relevance judgements

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Large-scale validation and analysis of interleaved search evaluation

ACM Transactions on Information Systems (TOIS)
The reliability of metrics based on graded relevance

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
On effectiveness measures and relevance functions in ranking INEX systems

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Assessing effectiveness in video retrieval

CIVR'05 Proceedings of the 4th international conference on Image and Video Retrieval
Bootstrap-Based comparisons of IR metrics for finding one relevant document

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
CLEF 2004: ad hoc track overview and results analysis

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
How do named entities contribute to retrieval effectiveness?

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Measuring the variability in effectiveness of a retrieval system

IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Information retrieval evaluation with partial relevance judgment

BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling
Evaluation of system measures for incomplete relevance judgment in IR

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Benchmarks, performance evaluation and contests for 3D shape retrieval

Proceedings of the 10th Performance Metrics for Intelligent Systems Workshop
Using XML logical structure to retrieve (multimedia)

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Differences in effectiveness across sub-collections

Proceedings of the 21st ACM international conference on Information and knowledge management
Evaluating question answering validation as a classification problem

Language Resources and Evaluation
TREC-Style evaluations

PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
Aggregating evidence from hospital departments to improve medical records search

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Leading people to longer queries

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics

Proceedings of the 2013 Conference on the Theory of Information Retrieval
On Using Fewer Topics in Information Retrieval Evaluations

Proceedings of the 2013 Conference on the Theory of Information Retrieval
Evaluation as a service for information retrieval

ACM SIGIR Forum
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Retrieval mechanisms are frequently compared by computing the respective average scores for some effectiveness metric across a common set of information needs or topics, with researchers concluding one method is superior based on those averages. Since comparative retrieval system behavior is known to be highly variable across topics, good experimental design requires that a "sufficient" number of topics be used in the test. This paper uses TREC results to empirically derive error rates based on the number of topics used in a test and the observed difference in the average scores. The error rates quantify the likelihood that a different set of topics of the same size would lead to a different conclusion. We directly compute error rates for topic sets up to size 25, and extrapolate those rates for larger topic set sizes. The error rates found are larger than anticipated, indicating researchers need to take care when concluding one method is better than another, especially if few topics are used.