Evaluating evaluation measure stability

Authors:
Chris Buckley;Ellen M. Voorhees
Affiliations:
Sabir Research Inc., Gaithersburg, MD;National Institute of Standards and Technology, Gaithersburg, Maryland
Venue:
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2000

Citing 12
Cited 178

The state of retrieval system evaluation

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
The pragmatics of information retrieval experimentation, revisited

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Presenting results of experimental retrieval comparisons

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
On selecting a measure of retrieval effectiveness. Part I.

Readings in information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

21st Annual ACM/SIGIR International Conference on Research and Development in Information Retrieval
Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Overview of the sixth text REtrieval conference (TREC-6)

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Information Retrieval

Information Retrieval

Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Generic summaries for indexing in information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Why batch and user evaluations do not give the same results

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Improved retrieval effectiveness through impact transformation

ADC '02 Proceedings of the 13th Australasian database conference - Volume 5
Impact transformation: effective and efficient web retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation in information retrieval

Lectures on information retrieval
A compact and efficient image retrieval approach based on border/interior pixel classification

Proceedings of the eleventh international conference on Information and knowledge management
Measuring Search Engine Quality

Information Retrieval
Introduction to the Special Issue: Overview of the TREC Routing and Filtering Tasks

Information Retrieval
Comparing the Performance of Adaptive Filtering and Ranked Output Systems

Information Retrieval
Evaluation of Text Retrieval Systems

Programming and Computing Software
Early user---system interaction for database selection in massive domain-specific online environments

ACM Transactions on Information Systems (TOIS)
Long-Term Learning for Web Search Engines

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Evaluation in Information Retrieval

ESSIR '00 Proceedings of the Third European Summer-School on Lectures on Information Retrieval-Revised Lectures
Analysis of performance variation using query expansion

Journal of the American Society for Information Science and Technology
Using manually-built web directories for automatic evaluation of known-item retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Using titles and category names from editor-driven taxonomies for automatic evaluation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Measuring retrieval effectiveness: a new proposal and a first experimental validation

Journal of the American Society for Information Science and Technology
Dynamic Composition of Information Retrieval Techniques

Journal of Intelligent Information Systems
Scaling IR-system evaluation using term relevance sets

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Building an information retrieval test collection for spontaneous conversational speech

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Measuring ineffectiveness

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
The robustness of content-based search in hierarchical peer to peer networks

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Client-system collaboration for legal corpus selection in an online production environment

ICAIL '03 Proceedings of the 9th international conference on Artificial intelligence and law
Learning to Rank

Information Retrieval
A framework for determining necessary query set sizes to evaluate web search effectiveness

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Binary and graded relevance in IR evaluations: comparison of the effects on ranking of IR systems

Information Processing and Management: an International Journal
The maximum entropy method for analyzing retrieval measures

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Simplified similarity scoring using term ranks

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of resources for question answering evaluation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A geometric interpretation of r-precision and its correlation with average precision

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Surrogate scoring for improved metasearch precision

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Revisiting the effect of topic set size on retrieval error

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A geometric interpretation and analysis of R-precision

Proceedings of the 14th ACM international conference on Information and knowledge management
Incremental test collections

Proceedings of the 14th ACM international conference on Information and knowledge management
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Recommended reading for IR research students

ACM SIGIR Forum
Automatic ranking of information retrieval systems using data fusion

Information Processing and Management: an International Journal
Mining Adaptive Ratio Rules from Distributed Data Sources

Data Mining and Knowledge Discovery
Building a reusable test collection for question answering

Journal of the American Society for Information Science and Technology - Research Articles
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
User modelling using evolutionary interactive reinforcement learning

Information Retrieval
User performance versus precision measures for simple search tasks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation in (XML) information retrieval: expected precision-recall with user modelling (EPRUM)

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Give me just one highly relevant document: P-measure

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
On GMAP: and other transformations

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval

ACM Transactions on Information Systems (TOIS)
Creating a test collection for citation-based IR experiments

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
A machine learning based approach to evaluating retrieval systems

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
On the reliability of factoid question answering evaluation

ACM Transactions on Asian Language Information Processing (TALIP)
The phrase-based vector space model for automatic retrieval of free-text medical documents

Data & Knowledge Engineering
On the reliability of information retrieval metrics based on graded relevance

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
On rank-based effectiveness measures and optimization

Information Retrieval
On the robustness of relevance measures with incomplete judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategic system comparisons via targeted relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hits hits TREC: exploring IR evaluation results with network analysis

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
How well does result relevance predict session satisfaction?

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A new approach for evaluating query expansion: query-document term mismatch

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Workload sampling for enterprise search evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Robust techniques for organizing and retrieving spoken documents

EURASIP Journal on Applied Signal Processing
Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Repeatable evaluation of search services in dynamic environments

ACM Transactions on Information Systems (TOIS)
Evaluation of retrieval effectiveness with incomplete relevance data: Theoretical and experimental comparison of three measures

Information Processing and Management: an International Journal
A strategy for allowing meaningful and comparable scores in approximate matching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Semantic components enhance retrieval of domain-specific documents

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Evolved term-weighting schemes in Information Retrieval: an analysis of the solution space

Artificial Intelligence Review
Using information gain to improve multi-modal information retrieval systems

Information Processing and Management: an International Journal
Incremental cluster-based retrieval using compressed cluster-skipping inverted files

ACM Transactions on Information Systems (TOIS)
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
The good and the bad system: does the test collection predict users' effectiveness?

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval sensitivity under training using different measures

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Precision-at-ten considered redundant

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
The Simplest XML Retrieval Baseline That Could Possibly Work

Focused Access to XML Documents
On test collections for adaptive information retrieval

Information Processing and Management: an International Journal
Comparing metrics across TREC and NTCIR: the robustness to system bias

Proceedings of the 17th ACM conference on Information and knowledge management
Local search: A guide for the information retrieval practitioner

Information Processing and Management: an International Journal
An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions

Artificial Intelligence Review
A New Shape Benchmark for 3D Object Retrieval

ISVC '08 Proceedings of the 4th International Symposium on Advances in Visual Computing
An Ontology-Based Framework for Knowledge Retrieval

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
An empirical analysis of information retrieval based concept location techniques in software comprehension

Empirical Software Engineering
Query expansion with a medical ontology to improve a multimodal information retrieval system

Computers in Biology and Medicine
Possibilistic networks for information retrieval

International Journal of Approximate Reasoning
Using argumentation to retrieve articles with similar citations from MEDLINE

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Topic (query) selection for IR evaluation

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic set size redux

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Visualizing the problems with the INEX topics

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
Using semantic components to search for domain-specific documents: An evaluation from the system perspective and the user perspective

Information Systems
Using semantic components to search for domain-specific documents: An evaluation from the system perspective and the user perspective

Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
IR Evaluation without a Common Set of Topics

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Bridging lexical gaps between queries and questions on large online Q&A collections with compact translation models

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Graph Matching Algorithms for Business Process Model Similarity Search

BPM '09 Proceedings of the 7th International Conference on Business Process Management
A few good topics: Experiments in topic set reduction for retrieval evaluation

ACM Transactions on Information Systems (TOIS)
Exploiting Disambiguation and Discrimination in Information Retrieval Systems

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Empirical justification of the gain and discount function for nDCG

Proceedings of the 18th ACM conference on Information and knowledge management
A personalized recommender system for digital libraries

Proceedings of the 14th Brazilian Symposium on Multimedia and the Web
Indexing and searching strategies for the Russian language

Journal of the American Society for Information Science and Technology
Exploiting bilingual information to improve web search

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Metric and Relevance Mismatch in Retrieval Evaluation

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Binary and graded relevance in IR evaluations-Comparison of the effects on ranking of IR systems

Information Processing and Management: an International Journal
Modelling field dependencies on structured documents with fuzzy logic

FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
Volumetric Features for Video Event Detection

International Journal of Computer Vision
Evaluating information retrieval system performance based on user preference

Journal of Intelligent Information Systems
On statistical analysis and optimization of information retrieval effectiveness metrics

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
PRES: a score metric for evaluating recall-oriented information retrieval applications

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A knowledge-based model using ontologies for personalized web information gathering

Web Intelligence and Agent Systems
Contextualizing semantic representations using syntactically enriched vector models

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Pattern based keyword extraction for contextual advertising

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A vector space analysis of swedish patent claims with different linguistic indices

PaIR '10 Proceedings of the 3rd international workshop on Patent information retrieval
Examining the robustness of evaluation metrics for patent retrieval with incomplete relevance judgements

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Similarity of business process models: Metrics and evaluation

Information Systems
Structure vs. content in hierarchical corpora

Information Retrieval
Research methodology in studies of assessor effort for information retrieval evaluation

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
The influence of commercial intent of search results on their perceived relevance

Proceedings of the 2011 iConference
Using clustering to improve retrieval evaluation without relevance judgments

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Fast business process similarity search with feature-based similarity estimation

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems - Volume Part I
Evaluation effort, reliability and reusability in XML retrieval

Journal of the American Society for Information Science and Technology
On the informativeness of cascade and intent-aware effectiveness measures

Proceedings of the 20th international conference on World wide web
Latent semantic indexing (LSI) fails for TREC collections

ACM SIGKDD Explorations Newsletter
Exploring the music similarity space on the web

ACM Transactions on Information Systems (TOIS)
Accuracy of inter-researcher similarity measures based on topical and social clues

Scientometrics
Selecting vantage objects for similarity indexing

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
A simple measure to assess non-response

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Identification and treatment of multiword expressions applied to information retrieval

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Behavioral similarity: a proper metric

BPM'11 Proceedings of the 9th international conference on Business process management
Diverse retrieval via greedy optimization of expected 1-call@k in a latent subtopic relevance model

Proceedings of the 20th ACM international conference on Information and knowledge management
Principles for robust evaluation infrastructure

Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation
Using the euclidean distance for retrieval evaluation

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Leveraging web services discovery with customizable hybrid matching

ICSOC'06 Proceedings of the 4th international conference on Service-Oriented Computing
Multiple testing in statistical analysis of systems-based information retrieval experiments

ACM Transactions on Information Systems (TOIS)
The reliability of metrics based on graded relevance

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
On effectiveness measures and relevance functions in ranking INEX systems

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Bootstrap-Based comparisons of IR metrics for finding one relevant document

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Adaptive query-based sampling of distributed collections

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Space-Limited ranked query evaluation using adaptive pruning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Ranking fusion methods applied to on-line handwriting information retrieval

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exploring cost-effective approaches to human evaluation of search engine relevance

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Information retrieval evaluation with partial relevance judgment

BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling
Evaluation of system measures for incomplete relevance judgment in IR

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Fast discovery of similar sequences in large genomic collections

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Measuring the ability of score distributions to model relevance

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Fast business process similarity search

Distributed and Parallel Databases
On smoothing average precision

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
On aggregating labels from multiple crowd workers to infer relevance of documents

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
#nowplaying Madonna: a large-scale evaluation on estimating similarities between music artists and between movies from microblogs

Information Retrieval
Experimental methods for information retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A pattern discovery model for effective text mining

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Discovering relevant features for effective query formulation

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Ranking, relevance judgment, and precision of information retrieval on children's queries: Evaluation of Google, Yahoo!, Bing, Yahoo! Kids, and ask Kids

Journal of the American Society for Information Science and Technology
Evaluating question answering validation as a classification problem

Language Resources and Evaluation
Measuring the coverage and redundancy of information search services on e-commerce platforms

Electronic Commerce Research and Applications
Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval

Journal of the American Society for Information Science and Technology
Extended structural relevance framework: a framework for evaluating structured document retrieval

Information Retrieval
Adopting relevance feature to learn personalized ontologies

AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence
TREC-Style evaluations

PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
Scoring-Thresholding pattern based text classifier

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part I
Matching Relevance Features with Ontological Concepts

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
A survey of faceted search

Journal of Web Engineering
On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics

Proceedings of the 2013 Conference on the Theory of Information Retrieval
On Using Fewer Topics in Information Retrieval Evaluations

Proceedings of the 2013 Conference on the Theory of Information Retrieval
Maintaining discriminatory power in quantized indexes

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A pattern based two-stage text classifier

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
The whens and hows of learning to rank for web search

Information Retrieval
Document Score Distribution Models for Query Performance Inference and Prediction

ACM Transactions on Information Systems (TOIS)
Text mining in negative relevance feedback

Web Intelligence and Agent Systems
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems
Semantic concept-enriched dependence model for medical information retrieval

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.