A comparison of statistical significance tests for information retrieval evaluation

Authors:
Mark D. Smucker;James Allan;Ben Carterette
Affiliations:
University of Massachusetts Amherst, Amherst, MA;University of Massachusetts Amherst, Amherst, MA;University of Massachusetts Amherst, Amherst, MA
Venue:
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Year:
2007

Citing 11
Cited 72

Randomization tests

Randomization tests
Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Non-parametric significance tests of retrieval performance comparisons

Journal of Information Science
Empirical methods for artificial intelligence

Empirical methods for artificial intelligence
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
Information Retrieval

Information Retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Validity and power of t-test for comparing MAP and GMAP

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Hypothesis testing with incomplete relevance judgments

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
Book search: indexing the valuable parts

Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories
A complex network approach to text summarization

Information Sciences: an International Journal
An Ontology-Based Framework for Knowledge Retrieval

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
A dimensional retrieval model for integrating semantics and statistical evidence in context for genomics literature search

Computers in Biology and Medicine
Using the Web as corpus for self-training text categorization

Information Retrieval
Regression Rank: Learning to Meet the Opportunity of Descriptive Queries

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
On rank correlation and the distance between rankings

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
An improved markov random field model for supporting verbose queries

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Coupling semi-supervised learning of categories and relations

SemiSupLearn '09 Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing
Improvements that don't add up: ad-hoc retrieval results since 1998

Proceedings of the 18th ACM conference on Information and knowledge management
Exploiting bilingual information to improve web search

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Scholarly paper recommendation via user's recent research interests

Proceedings of the 10th annual joint conference on Digital libraries
Predicting searcher frustration

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
From fusion to re-ranking: a semantic approach

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A knowledge-based model using ontologies for personalized web information gathering

Web Intelligence and Agent Systems
On identifying representative relevant documents

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Term necessity prediction

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Rank learning for factoid question answering with linguistic and semantic constraints

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Fast query expansion using approximations of relevance models

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Evaluation of axiomatic approaches to crosslanguage retrieval

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
UNIBA-SENSE @ CLEF 2009: robust WSD task

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
CLEF-IP 2009: retrieval experiments in the intellectual property domain

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Document expansion for image retrieval

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Document expansion based on WordNet for robust IR

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Lost in translation: authorship attribution using frame semantics

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Active learning to maximize accuracy vs. effort in interactive information retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Cluster-based fusion of retrieved lists

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Re-ranking search results using an additional retrieved list

Information Retrieval
Model-based inference about IR systems

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Efficiency optimizations for interpolating subqueries

Proceedings of the 20th ACM international conference on Information and knowledge management
Fusing different information retrieval systems according to query-topics: a study based on correlation in information retrieval systems and TREC topics

Information Retrieval
A study of the integration of passage-, document-, and cluster-based information for re-ranking search results

Information Retrieval
Multiple testing in statistical analysis of systems-based information retrieval experiments

ACM Transactions on Information Systems (TOIS)
Effective query formulation with multiple information sources

Proceedings of the fifth ACM international conference on Web search and data mining
Measuring the variability in effectiveness of a retrieval system

IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Evaluation with informational and navigational intents

Proceedings of the 21st international conference on World Wide Web
See what's enBlogue: real-time emergent topic identification in social media

Proceedings of the 15th International Conference on Extending Database Technology
Result disambiguation in web people search

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Recommending source code for use in rapid software prototypes

Proceedings of the 34th International Conference on Software Engineering
Search, interrupted: understanding and predicting search task continuation

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Modeling higher-order term dependencies in information retrieval using query hypergraphs

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Experimental methods for information retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Two-part segmentation of text documents

Proceedings of the 21st ACM international conference on Information and knowledge management
Quality models for microblog retrieval

Proceedings of the 21st ACM international conference on Information and knowledge management
Differences in effectiveness across sub-collections

Proceedings of the 21st ACM international conference on Information and knowledge management
Temporal models for microblogs

Proceedings of the 21st ACM international conference on Information and knowledge management
Generating pseudo test collections for learning to rank scientific articles

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Proxemic conceptual network based on ontology enrichment for representing documents in IR

EKAW'12 Proceedings of the 18th international conference on Knowledge Engineering and Knowledge Management
The MAHNOB Laughter database

Image and Vision Computing
Application of Text Summarization techniques to the Geographical Information Retrieval task

Expert Systems with Applications: An International Journal
An evaluation of labelling-game data for video retrieval

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Query suggestions for textual problem solution repositories

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Typing candidate answers using type coercion

IBM Journal of Research and Development
Deciding on an adjustment for multiplicity in IR experiments

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Modeling term dependencies with quantum language models for IR

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Flat vs. hierarchical phrase-based translation models for cross-language information retrieval

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A comparison of the optimality of statistical significance tests for information retrieval evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Interesting event detection through hall of fame rankings

Proceedings of the ACM SIGMOD Workshop on Databases and Social Networks
Revisiting Exhaustivity and Specificity Using Propositional Logic and Lattice Theory

Proceedings of the 2013 Conference on the Theory of Information Retrieval
Clustering-based transduction for learning a ranking model with limited human labels

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Graph-of-word and TW-IDF: new approach to ad hoc IR

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Improving pseudo-relevance feedback via tweet selection

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Portfolio: Searching for relevant functions and their usages in millions of lines of code

ACM Transactions on Software Engineering and Methodology (TOSEM) - Testing, debugging, and error handling, formal methods, lifecycle concerns, evolution and maintenance
Effective and Robust Query-Based Stemming

ACM Transactions on Information Systems (TOIS)
Framing image description as a ranking task: data, models and evaluation metrics

Journal of Artificial Intelligence Research
Text mining in negative relevance feedback

Web Intelligence and Agent Systems
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's randomization (permutation) test as non-parametric significance tests for IR but these tests have seen little use. For each of these five tests, we took the ad-hoc retrieval runs submitted to TRECs 3 and 5-8, and for each pair of runs, we measured the statistical significance of the difference in their mean average precision. We discovered that there is little practical difference between the randomization, bootstrap, and t tests. Both the Wilcoxon and sign test have a poor ability to detect significance and have the potential to lead to false detections of significance. The Wilcoxon and sign tests are simplified variants of the randomization test and their use should be discontinued for measuring the significance of a difference between means.