How reliable are the results of large-scale information retrieval experiments?

Authors:
Justin Zobel
Affiliations:
Department of Computer Science, RMIT, GPO Box, 2476V, Melbourne 3001, Australia
Venue:
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
1998

Citing 9
Cited 163

A critical investigation of recall and precision as measures of retrieval system performance

ACM Transactions on Information Systems (TOIS)
The state of retrieval system evaluation

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
The pragmatics of information retrieval experimentation, revisited

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Efficient retrieval of partial documents

TREC-2 Proceedings of the second conference on Text retrieval conference
Relevance judgments for assessing recall

Information Processing and Management: an International Journal
STAIRS redux: thoughts on the STAIRS evaluation, ten years after

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Variations in relevance assessments and the measurement of retrieval effectiveness

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal

Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
European Research Letter: cross-language system evaluation: the CLEF campaigns

Journal of the American Society for Information Science and Technology
Improved retrieval effectiveness through impact transformation

ADC '02 Proceedings of the 13th Australasian database conference - Volume 5
Liberal relevance criteria of TREC -: counting on negligible documents?

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Query association for effective retrieval

Proceedings of the eleventh international conference on Information and knowledge management
Pooling for a Large-Scale Test Collection: An Analysis of the Search Results from the First NTCIR Workshop

Information Retrieval
Evaluation of Text Retrieval Systems

Programming and Computing Software
Some thoughts on the reported results of TREC

Information Processing and Management: an International Journal
CLEF 2000 - Overview of Results

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
CLIR Evaluation at TREC

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Letters to the editor

Information Processing and Management: an International Journal
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Interactive Visualization of Multiple Query Results

INFOVIS '01 Proceedings of the IEEE Symposium on Information Visualization 2001 (INFOVIS'01)
Building a filtering test collection for TREC 2002

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Current Status of the Evaluation of Information Retrieval

Journal of Medical Systems
Methods for ranking information retrieval systems without relevance judgments

Proceedings of the 2003 ACM symposium on Applied computing
Query expansion using associated queries

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
A unified model for metasearch, pooling, and system evaluation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Cross-Language Evaluation Forum: Objectives, Results, Achievements

Information Retrieval
Access-ordered indexes

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Measuring retrieval effectiveness: a new proposal and a first experimental validation

Journal of the American Society for Information Science and Technology
The effectiveness of automatically structured queries in digital libraries

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Scaling IR-system evaluation using term relevance sets

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Forming test collections with no system pooling

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of named entities on effectiveness in cross-language information retrieval evaluation

Proceedings of the 2005 ACM symposium on Applied computing
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of resources for question answering evaluation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Set-based vector model: An efficient approach for correlation-based ranking

ACM Transactions on Information Systems (TOIS)
Incremental test collections

Proceedings of the 14th ACM international conference on Information and knowledge management
Recommended reading for IR research students

ACM SIGIR Forum
The text retrieval conferences (TRECS)

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998
Automatic ranking of information retrieval systems using data fusion

Information Processing and Management: an International Journal
Building a reusable test collection for question answering

Journal of the American Society for Information Science and Technology - Research Articles
The TREC 2005 robust track

ACM SIGIR Forum
User performance versus precision measures for simple search tasks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic test collections: measuring search effectiveness on the live web

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Usage-oriented multimedia information retrieval technological evaluation

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
On the significance of cluster-temporal browsing for generic video retrieval: a statistical analysis

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A machine learning based approach to evaluating retrieval systems

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Efficient query expansion with auxiliary data structures

Information Systems
Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Information Processing and Management: an International Journal
Using the structure of overlap between search results to rank retrieval systems without relevance judgments

Information Processing and Management: an International Journal
Methodologies for Evaluation of Note-Based Music-Retrieval Systems

INFORMS Journal on Computing
Argumentative feedback: a linguistically-motivated term expansion for information retrieval

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Robust test collections for retrieval evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reliable information retrieval evaluation with incomplete and biased judgements

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Alternatives to Bpref

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Deconstructing nuggets: the stability and reliability of complex question answering evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
On the robustness of relevance measures with incomplete judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategic system comparisons via targeted relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Repeatable evaluation of search services in dynamic environments

ACM Transactions on Information Systems (TOIS)
Evaluation of retrieval effectiveness with incomplete relevance data: Theoretical and experimental comparison of three measures

Information Processing and Management: an International Journal
Stemming Indonesian: A confix-stripping approach

ACM Transactions on Asian Language Information Processing (TALIP)
Inferring document relevance from incomplete information

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hypothesis testing with incomplete relevance judgments

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Evaluation of phrasal query suggestions

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Evaluating epistemic uncertainty under incomplete assessments

Information Processing and Management: an International Journal
Re-examining the effects of adding relevance information in a relevance feedback environment

Information Processing and Management: an International Journal
Assessing multivariate Bernoulli models for information retrieval

ACM Transactions on Information Systems (TOIS)
On information retrieval metrics designed for evaluation with incomplete relevance assessments

Information Retrieval
Enabling the creation of domain-specific reference collections to support text-based information retrieval experiments in the architecture, engineering and construction industries

Advanced Engineering Informatics
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Predicting information seeker satisfaction in community question answering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Comparing metrics across TREC and NTCIR:: the robustness to pool depth bias

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision when judgments are incomplete

Knowledge and Information Systems
Sound and complete relevance assessment for XML retrieval

ACM Transactions on Information Systems (TOIS)
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
The Simplest XML Retrieval Baseline That Could Possibly Work

Focused Access to XML Documents
A Comparison of Interactive and Ad-Hoc Relevance Assessments

Focused Access to XML Documents
Revisiting the relationship between document length and relevance

Proceedings of the 17th ACM conference on Information and knowledge management
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias

Proceedings of the 17th ACM conference on Information and knowledge management
Experiments with English-Persian text retrieval

Proceedings of the 2nd ACM workshop on Improving non english web searching
Enriching a Thesaurus to Improve Retrieval of Audiovisual Documents

SAMT '08 Proceedings of the 3rd International Conference on Semantic and Digital Media Technologies: Semantic Multimedia
Comparative analysis of clicks and judgments for IR evaluation

Proceedings of the 2009 workshop on Web Search Click Data
Modeling information-seeker satisfaction in community question answering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Hamshahri: A standard Persian text collection

Knowledge-Based Systems
PSkip: estimating relevance ranking quality from web search clickthrough data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards methods for the collective gathering and quality control of relevance assessments

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Document selection methodologies for efficient and effective learning-to-rank

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic (query) selection for IR evaluation

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Implementing and evaluating phrasal query suggestions for proximity search

Information Systems
Implementing and evaluating phrasal query suggestions for proximity search

Information Systems
IR Evaluation without a Common Set of Topics

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Concept-based feature generation and selection for information retrieval

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
A few good topics: Experiments in topic set reduction for retrieval evaluation

ACM Transactions on Information Systems (TOIS)
Improvements that don't add up: ad-hoc retrieval results since 1998

Proceedings of the 18th ACM conference on Information and knowledge management
Against recall: is it persistence, cardinality, density, coverage, or totality?

ACM SIGIR Forum
So many topics, so little time

ACM SIGIR Forum
Measuring the reusability of test collections

Proceedings of the third ACM international conference on Web search and data mining
Click-based evidence for decaying weight distributions in search effectiveness metrics

Information Retrieval
Variation of relevance assessments for medical image retrieval

AMR'06 Proceedings of the 4th international conference on Adaptive multimedia retrieval: user, context, and feedback
A retrieval evaluation methodology for incomplete relevance assessments

ECIR'07 Proceedings of the 29th European conference on IR research
Overall comparison at the standard levels of recall of multiple retrieval methods with the Friedman test

ECIR'07 Proceedings of the 29th European conference on IR research
Evaluation in context

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Sampling precision to depth 10000 at CLEF 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Overview of the ImageCLEFmed 2008 medical image retrieval task

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Thesaurus enrichment for query expansion in audiovisual archives

Multimedia Tools and Applications
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Reusable test collections through experimental design

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
The FIRE 2008 Evaluation Exercise

ACM Transactions on Asian Language Information Processing (TALIP)
On the potential search effectiveness of MeSH (medical subject headings) terms

Proceedings of the third symposium on Information interaction in context
Recommendation in Internet forums and blogs

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Score aggregation techniques in retrieval experimentation

ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
User comments for news recommendation in forum-based social media

Information Sciences: an International Journal
Assessor error in stratified evaluation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Retrieval result presentation and evaluation

KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Sampling precision to depth 10000 at CLEF 2009

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
GikiCLEF: expectations and lessons learned

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Tie-breaking bias: effect of an uncontrolled parameter on information retrieval evaluation

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Research methodology in studies of assessor effort for information retrieval evaluation

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Boiling down information retrieval test collections

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Using clustering to improve retrieval evaluation without relevance judgments

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Evaluation effort, reliability and reusability in XML retrieval

Journal of the American Society for Information Science and Technology
Diagnostic Evaluation of Information Retrieval Models

ACM Transactions on Information Systems (TOIS)
Concept-Based Information Retrieval Using Explicit Semantic Analysis

ACM Transactions on Information Systems (TOIS)
Evaluation of information retrieval for E-discovery

Artificial Intelligence and Law
Bringing undergraduate students closer to a real-world information retrieval setting: methodology and resources

Proceedings of the 16th annual joint conference on Innovation and technology in computer science education
Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Selecting optimal training data for learning to rank

Information Processing and Management: an International Journal
Model-based inference about IR systems

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Selecting a subset of queries for acquisition of further relevance judgements

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Prioritizing relevance judgments to improve the construction of IR test collections

Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora

Proceedings of the 20th ACM international conference on Information and knowledge management
Evaluating large-scale distributed vertical search

Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Optimizing the cost of information retrieval testcollections

Proceedings of the 4th workshop on Workshop for Ph.D. students in information & knowledge management
Using the euclidean distance for retrieval evaluation

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
An overview of Web search evaluation methods

Computers and Electrical Engineering
Multiple testing in statistical analysis of systems-based information retrieval experiments

ACM Transactions on Information Systems (TOIS)
The interpretation of CAS

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
A fuzzy ranking approach for improving search results in Turkish as an agglutinative language

Expert Systems with Applications: An International Journal
Automated object extraction for medical image retrieval using the insight toolkit (ITK)

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
IR system evaluation using nugget-based test collections

Proceedings of the fifth ACM international conference on Web search and data mining
Exploring cost-effective approaches to human evaluation of search engine relevance

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Scalability influence on retrieval models: an experimental methodology

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Information retrieval evaluation with partial relevance judgment

BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling
Evaluation of system measures for incomplete relevance judgment in IR

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Retrieval status values in information retrieval evaluation

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Combining inverted indices and structured search for ad-hoc object retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
An uncertainty-aware query selection model for evaluation of IR systems

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Applying relevance feedback for retrieving web-page retrieval

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Approximate Recall Confidence Intervals

ACM Transactions on Information Systems (TOIS)
TREC-Style evaluations

PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A comparison of the optimality of statistical significance tests for information retrieval evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
On Using Fewer Topics in Information Retrieval Evaluations

Proceedings of the 2013 Conference on the Theory of Information Retrieval
A new statistical strategy for pooling: ELI

Information Processing Letters
Choices in batch information retrieval evaluation

Proceedings of the 18th Australasian Document Computing Symposium
The whens and hows of learning to rank for web search

Information Retrieval
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Two stages in measurement of techniques for informationretrieval are gathering of documents for relevance assessment anduse of the assessments to numerically evaluate effectiveness. Weconsider both of these stages in the context of the TRECexperiments, to determine whether they lead to measurements thatare trustworthy and fair. Our detailed empirical investigation ofthe TREC results shows that the measured relative performance ofsystems appears to be reliable, but that recall is overestimated:it is likely that many relevant documents have not been found. Wepropose a new pooling strategy that can significantly in- creasethe number of relevant documents found for given effort, withoutcompromising fairness.