Using statistical testing in the evaluation of retrieval experiments
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Time, relevance and interaction modelling for information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical inference in retrieval effectiveness evaluation
Information Processing and Management: an International Journal
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
IR evaluation methods for retrieving highly relevant documents
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Redundant documents and search effectiveness
Proceedings of the 14th ACM international conference on Information and knowledge management
Formal models for expert finding in enterprise corpora
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic test collections: measuring search effectiveness on the live web
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Why structural hints in queries do not help XML-retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Is 1 noun worth 2 adjectives?: measuring relative feature utility
Information Processing and Management: an International Journal
Investigating the exhaustivity dimension in content-oriented XML element retrieval evaluation
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval
ACM Transactions on Information Systems (TOIS)
Percent perfect performance (PPP)
Information Processing and Management: an International Journal
On the reliability of information retrieval metrics based on graded relevance
Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
An analysis of two approaches in information retrieval: From frameworks to study designs
Journal of the American Society for Information Science and Technology
Using query logs to establish vocabularies in distributed information retrieval
Information Processing and Management: an International Journal
Deconstructing nuggets: the stability and reliability of complex question answering evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
On the robustness of relevance measures with incomplete judgments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategic system comparisons via targeted relevance judgments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hits hits TREC: exploring IR evaluation results with network analysis
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
How well does result relevance predict session satisfaction?
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A new approach for evaluating query expansion: query-document term mismatch
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Validity and power of t-test for comparing MAP and GMAP
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Power and bias of subset pooling strategies
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Workload sampling for enterprise search evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Repeatable evaluation of search services in dynamic environments
ACM Transactions on Information Systems (TOIS)
Error correction vs. query garbling for Arabic OCR document retrieval
ACM Transactions on Information Systems (TOIS)
Information Processing and Management: an International Journal
Stemming Indonesian: A confix-stripping approach
ACM Transactions on Asian Language Information Processing (TALIP)
A comparison of statistical significance tests for information retrieval evaluation
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hypothesis testing with incomplete relevance judgments
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Locality-Based pruning methods for web search
ACM Transactions on Information Systems (TOIS)
Evaluating epistemic uncertainty under incomplete assessments
Information Processing and Management: an International Journal
An analysis on document length retrieval trends in language modeling smoothing
Information Retrieval
How robust are multilingual information retrieval systems?
Proceedings of the 2008 ACM symposium on Applied computing
An outranking approach for information retrieval
Information Retrieval
A comparative study of probabilistic and language models for information retrieval
ADC '08 Proceedings of the nineteenth conference on Australasian database - Volume 75
Effect of OCR error correction on Arabic retrieval
Information Retrieval
Score standardization for inter-collection comparison of retrieval systems
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Discriminative probabilistic models for passage based retrieval
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Precision-at-ten considered redundant
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness
ACM Transactions on Information Systems (TOIS)
On test collections for adaptive information retrieval
Information Processing and Management: an International Journal
Retrievability: an evaluation measure for higher order information access tasks
Proceedings of the 17th ACM conference on Information and knowledge management
Statistical power in retrieval experimentation
Proceedings of the 17th ACM conference on Information and knowledge management
Book search: indexing the valuable parts
Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories
Local search: A guide for the information retrieval practitioner
Information Processing and Management: an International Journal
Vaidurya: A multiple-ontology, concept-based, context-sensitive clinical-guideline search engine
Journal of Biomedical Informatics
Concept unification of terms in different languages via web mining for Information Retrieval
Information Processing and Management: an International Journal
Improving Search Performance: A Lesson Learned from Evaluating Search Engines Using Thai Queries
IEICE - Transactions on Information and Systems
Topic development pattern analysis-based adaptation of information spaces
The New Review of Hypermedia and Multimedia - Adaptive Hypermedia
Possibilistic networks for information retrieval
International Journal of Approximate Reasoning
A 2-poisson model for probabilistic coreference of named entities for improved text retrieval
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Experiments in CLIR using fuzzy string search based on surface similarity
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic (query) selection for IR evaluation
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Personalized text snippet extraction using statistical language models
Pattern Recognition
IR Evaluation without a Common Set of Topics
ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
A few good topics: Experiments in topic set reduction for retrieval evaluation
ACM Transactions on Information Systems (TOIS)
Improvements that don't add up: ad-hoc retrieval results since 1998
Proceedings of the 18th ACM conference on Information and knowledge management
Probabilistic static pruning of inverted files
ACM Transactions on Information Systems (TOIS)
So many topics, so little time
ACM SIGIR Forum
Organization and Tagging of Blog and News Entries Based on Content Reuse
Journal of Signal Processing Systems
Journal on Image and Video Processing - Special issue on image and video processing for cultural heritage
A retrieval evaluation methodology for incomplete relevance assessments
ECIR'07 Proceedings of the 29th European conference on IR research
ECIR'07 Proceedings of the 29th European conference on IR research
Modeling the web as a hypergraph to compute page reputation
Information Systems
A knowledge-rich similarity measure for improving IT incident resolution process
Proceedings of the 2010 ACM Symposium on Applied Computing
Leveraging structural knowledge for hierarchically-informed keyword weight propagation in the web
WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
On the choice of effectiveness measures for learning to rank
Information Retrieval
The effect of assessor error on IR system evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Comparing the sensitivity of information retrieval metrics
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Score aggregation techniques in retrieval experimentation
ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
User comments for news recommendation in forum-based social media
Information Sciences: an International Journal
On identifying representative relevant documents
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Pattern based keyword extraction for contextual advertising
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Select-the-Best-Ones: A new way to judge relative relevance
Information Processing and Management: an International Journal
Retrieval result presentation and evaluation
KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Tie-breaking bias: effect of an uncontrolled parameter on information retrieval evaluation
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Research methodology in studies of assessor effort for information retrieval evaluation
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Boiling down information retrieval test collections
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Improving tag recommendation using social networks
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Evaluation effort, reliability and reusability in XML retrieval
Journal of the American Society for Information Science and Technology
Trust your social network according to satisfaction, reputation and privacy
Proceedings of the Third International Workshop on Reliability, Availability, and Security
On the informativeness of cascade and intent-aware effectiveness measures
Proceedings of the 20th international conference on World wide web
Evaluation of information retrieval for E-discovery
Artificial Intelligence and Law
Exploring the music similarity space on the web
ACM Transactions on Information Systems (TOIS)
Re-ranking search results using an additional retrieved list
Information Retrieval
Prioritizing relevance judgments to improve the construction of IR test collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Using the euclidean distance for retrieval evaluation
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Multiple testing in statistical analysis of systems-based information retrieval experiments
ACM Transactions on Information Systems (TOIS)
Large-scale validation and analysis of interleaved search evaluation
ACM Transactions on Information Systems (TOIS)
HiXEval: highlighting XML retrieval evaluation
INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
GeoCLEF: the CLEF 2005 cross-language geographic information retrieval track overview
CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
On effectiveness measures and relevance functions in ranking INEX systems
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
A fuzzy ranking approach for improving search results in Turkish as an agglutinative language
Expert Systems with Applications: An International Journal
Bootstrap-Based comparisons of IR metrics for finding one relevant document
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Word-Based correction for retrieval of arabic OCR degraded documents
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
A multiple criteria approach for information retrieval
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Sample sizes for query probing in uncooperative distributed information retrieval
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Measuring the variability in effectiveness of a retrieval system
IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Information retrieval evaluation with partial relevance judgment
BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling
Evaluation of system measures for incomplete relevance judgment in IR
FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Stemming arabic conjunctions and prepositions
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Matching meaning for cross-language information retrieval
Information Processing and Management: an International Journal
Experimental methods for information retrieval
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment
Information Processing and Management: an International Journal
Aggregation Methods for Proximity-Based Opinion Retrieval
ACM Transactions on Information Systems (TOIS)
Annotation-based document retrieval with probabilistic logics
ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Contextual evaluation of query reformulations in a search session by user simulation
Proceedings of the 21st ACM international conference on Information and knowledge management
Deciding on an adjustment for multiplicity in IR experiments
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
On the measurement of test collection reliability
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics
Proceedings of the 2013 Conference on the Theory of Information Retrieval
On Using Fewer Topics in Information Retrieval Evaluations
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Graph-of-word and TW-IDF: new approach to ad hoc IR
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Diversified search evaluation: lessons from the NTCIR-9 INTENT task
Information Retrieval
Creating test collections from user generated content for GIR evaluation
Proceedings of the 7th Workshop on Geographic Information Retrieval
Choices in batch information retrieval evaluation
Proceedings of the 18th Australasian Document Computing Symposium
Personalized tag recommendation based on generalized rules
ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
Rank-mediated collaborative tagging recommendation service using video-tag relationship prediction
Information Systems Frontiers
Evaluation as a service for information retrieval
ACM SIGIR Forum
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. Other work established an alternative benchmark for significance, but the resulting test was too stringent. In this paper, we revisit the question of how such tests should be used. We find that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems. Our results show that past empirical work on significance tests over-estimated the error of such tests. We also re-consider comparisons between the reliability of precision at rank 10 and mean average precision, arguing that past comparisons did not consider the assessor effort required to compute such measures. This investigation shows that assessor effort would be better spent building test collections with more topics, each assessed in less detail.