Information retrieval system evaluation: effort, sensitivity, and reliability

Authors:
Mark Sanderson;Justin Zobel
Affiliations:
University of Sheffield, Sheffield, UK;RMIT, Melbourne, Australia
Venue:
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2005

Citing 9
Cited 127

Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Time, relevance and interaction modelling for information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Formal models for expert finding in enterprise corpora

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic test collections: measuring search effectiveness on the live web

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Why structural hints in queries do not help XML-retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Is 1 noun worth 2 adjectives?: measuring relative feature utility

Information Processing and Management: an International Journal
Investigating the exhaustivity dimension in content-oriented XML element retrieval evaluation

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval

ACM Transactions on Information Systems (TOIS)
Percent perfect performance (PPP)

Information Processing and Management: an International Journal
On the reliability of information retrieval metrics based on graded relevance

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
An analysis of two approaches in information retrieval: From frameworks to study designs

Journal of the American Society for Information Science and Technology
Using query logs to establish vocabularies in distributed information retrieval

Information Processing and Management: an International Journal
Deconstructing nuggets: the stability and reliability of complex question answering evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
On the robustness of relevance measures with incomplete judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategic system comparisons via targeted relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hits hits TREC: exploring IR evaluation results with network analysis

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
How well does result relevance predict session satisfaction?

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A new approach for evaluating query expansion: query-document term mismatch

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Validity and power of t-test for comparing MAP and GMAP

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Power and bias of subset pooling strategies

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Workload sampling for enterprise search evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Repeatable evaluation of search services in dynamic environments

ACM Transactions on Information Systems (TOIS)
Error correction vs. query garbling for Arabic OCR document retrieval

ACM Transactions on Information Systems (TOIS)
Evaluation of retrieval effectiveness with incomplete relevance data: Theoretical and experimental comparison of three measures

Information Processing and Management: an International Journal
Stemming Indonesian: A confix-stripping approach

ACM Transactions on Asian Language Information Processing (TALIP)
A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hypothesis testing with incomplete relevance judgments

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Locality-Based pruning methods for web search

ACM Transactions on Information Systems (TOIS)
Evaluating epistemic uncertainty under incomplete assessments

Information Processing and Management: an International Journal
An analysis on document length retrieval trends in language modeling smoothing

Information Retrieval
How robust are multilingual information retrieval systems?

Proceedings of the 2008 ACM symposium on Applied computing
An outranking approach for information retrieval

Information Retrieval
A comparative study of probabilistic and language models for information retrieval

ADC '08 Proceedings of the nineteenth conference on Australasian database - Volume 75
Effect of OCR error correction on Arabic retrieval

Information Retrieval
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Discriminative probabilistic models for passage based retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Precision-at-ten considered redundant

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
On test collections for adaptive information retrieval

Information Processing and Management: an International Journal
Retrievability: an evaluation measure for higher order information access tasks

Proceedings of the 17th ACM conference on Information and knowledge management
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
Book search: indexing the valuable parts

Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories
Local search: A guide for the information retrieval practitioner

Information Processing and Management: an International Journal
Vaidurya: A multiple-ontology, concept-based, context-sensitive clinical-guideline search engine

Journal of Biomedical Informatics
Concept unification of terms in different languages via web mining for Information Retrieval

Information Processing and Management: an International Journal
Improving Search Performance: A Lesson Learned from Evaluating Search Engines Using Thai Queries

IEICE - Transactions on Information and Systems
Topic development pattern analysis-based adaptation of information spaces

The New Review of Hypermedia and Multimedia - Adaptive Hypermedia
Possibilistic networks for information retrieval

International Journal of Approximate Reasoning
A 2-poisson model for probabilistic coreference of named entities for improved text retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Experiments in CLIR using fuzzy string search based on surface similarity

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic (query) selection for IR evaluation

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic set size redux

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Personalized text snippet extraction using statistical language models

Pattern Recognition
IR Evaluation without a Common Set of Topics

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
A few good topics: Experiments in topic set reduction for retrieval evaluation

ACM Transactions on Information Systems (TOIS)
Improvements that don't add up: ad-hoc retrieval results since 1998

Proceedings of the 18th ACM conference on Information and knowledge management
Probabilistic static pruning of inverted files

ACM Transactions on Information Systems (TOIS)
So many topics, so little time

ACM SIGIR Forum
Organization and Tagging of Blog and News Entries Based on Content Reuse

Journal of Signal Processing Systems
Multimedia in cultural heritage manuscripts: integrating description, transcription, and image content

Journal on Image and Video Processing - Special issue on image and video processing for cultural heritage
A retrieval evaluation methodology for incomplete relevance assessments

ECIR'07 Proceedings of the 29th European conference on IR research
Overall comparison at the standard levels of recall of multiple retrieval methods with the Friedman test

ECIR'07 Proceedings of the 29th European conference on IR research
Modeling the web as a hypergraph to compute page reputation

Information Systems
A knowledge-rich similarity measure for improving IT incident resolution process

Proceedings of the 2010 ACM Symposium on Applied Computing
Leveraging structural knowledge for hierarchically-informed keyword weight propagation in the web

WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
On the choice of effectiveness measures for learning to rank

Information Retrieval
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Comparing the sensitivity of information retrieval metrics

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Score aggregation techniques in retrieval experimentation

ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
User comments for news recommendation in forum-based social media

Information Sciences: an International Journal
On identifying representative relevant documents

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Pattern based keyword extraction for contextual advertising

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Select-the-Best-Ones: A new way to judge relative relevance

Information Processing and Management: an International Journal
Retrieval result presentation and evaluation

KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Tie-breaking bias: effect of an uncontrolled parameter on information retrieval evaluation

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Research methodology in studies of assessor effort for information retrieval evaluation

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Boiling down information retrieval test collections

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Improving tag recommendation using social networks

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Evaluation effort, reliability and reusability in XML retrieval

Journal of the American Society for Information Science and Technology
Trust your social network according to satisfaction, reputation and privacy

Proceedings of the Third International Workshop on Reliability, Availability, and Security
On the informativeness of cascade and intent-aware effectiveness measures

Proceedings of the 20th international conference on World wide web
Evaluation of information retrieval for E-discovery

Artificial Intelligence and Law
Exploring the music similarity space on the web

ACM Transactions on Information Systems (TOIS)
Accuracy of inter-researcher similarity measures based on topical and social clues

Scientometrics
Re-ranking search results using an additional retrieved list

Information Retrieval
Extending the language modeling framework for sentence retrieval to include local context

Information Retrieval
Prioritizing relevance judgments to improve the construction of IR test collections

Proceedings of the 20th ACM international conference on Information and knowledge management
A study of the integration of passage-, document-, and cluster-based information for re-ranking search results

Information Retrieval
Using the euclidean distance for retrieval evaluation

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Multiple testing in statistical analysis of systems-based information retrieval experiments

ACM Transactions on Information Systems (TOIS)
Large-scale validation and analysis of interleaved search evaluation

ACM Transactions on Information Systems (TOIS)
HiXEval: highlighting XML retrieval evaluation

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
GeoCLEF: the CLEF 2005 cross-language geographic information retrieval track overview

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
On effectiveness measures and relevance functions in ranking INEX systems

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
A fuzzy ranking approach for improving search results in Turkish as an agglutinative language

Expert Systems with Applications: An International Journal
Bootstrap-Based comparisons of IR metrics for finding one relevant document

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Word-Based correction for retrieval of arabic OCR degraded documents

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
A multiple criteria approach for information retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Sample sizes for query probing in uncooperative distributed information retrieval

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Measuring the variability in effectiveness of a retrieval system

IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Information retrieval evaluation with partial relevance judgment

BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling
Evaluation of system measures for incomplete relevance judgment in IR

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Stemming arabic conjunctions and prepositions

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Matching meaning for cross-language information retrieval

Information Processing and Management: an International Journal
#nowplaying Madonna: a large-scale evaluation on estimating similarities between music artists and between movies from microblogs

Information Retrieval
Experimental methods for information retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal
Aggregation Methods for Proximity-Based Opinion Retrieval

ACM Transactions on Information Systems (TOIS)
Annotation-based document retrieval with probabilistic logics

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Contextual evaluation of query reformulations in a search session by user simulation

Proceedings of the 21st ACM international conference on Information and knowledge management
Deciding on an adjustment for multiplicity in IR experiments

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A comparison of the optimality of statistical significance tests for information retrieval evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics

Proceedings of the 2013 Conference on the Theory of Information Retrieval
On Using Fewer Topics in Information Retrieval Evaluations

Proceedings of the 2013 Conference on the Theory of Information Retrieval
Graph-of-word and TW-IDF: new approach to ad hoc IR

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Diversified search evaluation: lessons from the NTCIR-9 INTENT task

Information Retrieval
Creating test collections from user generated content for GIR evaluation

Proceedings of the 7th Workshop on Geographic Information Retrieval
Choices in batch information retrieval evaluation

Proceedings of the 18th Australasian Document Computing Symposium
Personalized tag recommendation based on generalized rules

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
Rank-mediated collaborative tagging recommendation service using video-tag relationship prediction

Information Systems Frontiers
Evaluation as a service for information retrieval

ACM SIGIR Forum
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. Other work established an alternative benchmark for significance, but the resulting test was too stringent. In this paper, we revisit the question of how such tests should be used. We find that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems. Our results show that past empirical work on significance tests over-estimated the error of such tests. We also re-consider comparisons between the reliability of precision at rank 10 and mean average precision, arguing that past comparisons did not consider the assessor effort required to compute such measures. This investigation shows that assessor effort would be better spent building test collections with more topics, each assessed in less detail.