Using statistical testing in the evaluation of retrieval experiments
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical inference in retrieval effectiveness evaluation
Information Processing and Management: an International Journal
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
Binary and graded relevance in IR evaluations: comparison of the effects on ranking of IR systems
Information Processing and Management: an International Journal
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Revisiting the effect of topic set size on retrieval error
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
On the reliability of information retrieval metrics based on graded relevance
Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
On effectiveness measures and relevance functions in ranking INEX systems
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Give me just one highly relevant document: P-measure
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
On GMAP: and other transformations
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
On the reliability of factoid question answering evaluation
ACM Transactions on Asian Language Information Processing (TALIP)
On the reliability of information retrieval metrics based on graded relevance
Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Repeatable evaluation of search services in dynamic environments
ACM Transactions on Information Systems (TOIS)
A comparison of statistical significance tests for information retrieval evaluation
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Score standardization for inter-collection comparison of retrieval systems
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Comparing metrics across TREC and NTCIR:: the robustness to pool depth bias
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On test collections for adaptive information retrieval
Information Processing and Management: an International Journal
Statistical power in retrieval experimentation
Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias
Proceedings of the 17th ACM conference on Information and knowledge management
On rank correlation and the distance between rankings
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Visualizing the problems with the INEX topics
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A few good topics: Experiments in topic set reduction for retrieval evaluation
ACM Transactions on Information Systems (TOIS)
Empirical justification of the gain and discount function for nDCG
Proceedings of the 18th ACM conference on Information and knowledge management
Extending average precision to graded relevance judgments
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A comparative analysis of cascade measures for novelty and diversity
Proceedings of the fourth ACM international conference on Web search and data mining
Using graded-relevance metrics for evaluating community QA answer selection
Proceedings of the fourth ACM international conference on Web search and data mining
On the informativeness of cascade and intent-aware effectiveness measures
Proceedings of the 20th international conference on World wide web
A simple measure to assess non-response
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Evaluating diversified search results using per-intent graded relevance
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Using the euclidean distance for retrieval evaluation
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Bootstrap-Based comparisons of IR metrics for finding one relevant document
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Measuring the variability in effectiveness of a retrieval system
IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Evaluation with informational and navigational intents
Proceedings of the 21st international conference on World Wide Web
Time-based calibration of effectiveness measures
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
On per-topic variance in IR evaluation
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A comprehensive analysis of parameter settings for novelty-biased cumulative gain
Proceedings of the 21st ACM international conference on Information and knowledge management
Evaluating question answering validation as a classification problem
Language Resources and Evaluation
Optimizing nDCG gains by minimizing effect of label inconsistency
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Summaries, ranked retrieval and sessions: a unified framework for information access evaluation
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Click model-based information retrieval metrics
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A mutual information-based framework for the analysis of information retrieval systems
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Preference based evaluation measures for novelty and diversity
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Summary of the NTCIR-10 INTENT-2 task: subtopic mining and search result diversification
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
On the reliability and intuitiveness of aggregated search metrics
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Diversified search evaluation: lessons from the NTCIR-9 INTENT task
Information Retrieval
Increasing evaluation sensitivity to diversity
Information Retrieval
The water filling model and the cube test: multi-dimensional evaluation for professional search
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
This paper describes how the Bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hypothesis Tests deserve more attention from the IR community, as they are based on fewer assumptions than traditional statistical significance tests. We then describe straightforward methods for comparing the sensitivity of IR metrics based on Bootstrap Hypothesis Tests. Unlike the heuristics-based "swap" method proposed by Voorhees and Buckley, our method estimates the performance difference required to achieve a given significance level directly from Bootstrap Hypothesis Test results. In addition, we describe a simple way of examining the accuracy of rank correlation between two metrics based on the Bootstrap Estimate of Standard Error. We demonstrate the usefulness of our methods using test collections and runs from the NTCIR CLIR track for comparing seven IR metrics, including those that can handle graded relevance and those based on the Geometric Mean.