Using statistical testing in the evaluation of retrieval experiments
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Information storage and retrieval
Information storage and retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Measuring retrieval effectiveness: a new proposal and a first experimental validation
Journal of the American Society for Information Science and Technology
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Binary and graded relevance in IR evaluations: comparison of the effects on ranking of IR systems
Information Processing and Management: an International Journal
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Give me just one highly relevant document: P-measure
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The reliability of metrics based on graded relevance
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
On effectiveness measures and relevance functions in ranking INEX systems
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Evaluating evaluation metrics based on the bootstrap
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Give me just one highly relevant document: P-measure
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
On the reliability of factoid question answering evaluation
ACM Transactions on Asian Language Information Processing (TALIP)
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for evaluating reliability of IR test collections
Information Processing and Management: an International Journal
Comparing metrics across TREC and NTCIR: the robustness to system bias
Proceedings of the 17th ACM conference on Information and knowledge management
Building a framework for the probability ranking principle by a family of expected weighted rank
ACM Transactions on Information Systems (TOIS)
A few good topics: Experiments in topic set reduction for retrieval evaluation
ACM Transactions on Information Systems (TOIS)
Extracting learning concepts from educational texts in intelligent tutoring systems automatically
Expert Systems with Applications: An International Journal
A simple measure to assess non-response
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Pattern Recognition Letters
Bootstrap-Based comparisons of IR metrics for finding one relevant document
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Measures for benchmarking semantic web service matchmaking correctness
ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Journal of the American Society for Information Science and Technology
Evaluating question answering validation as a classification problem
Language Resources and Evaluation
On the measurement of test collection reliability
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
This paper compares 14 information retrieval metrics based on graded relevance, together with 10 traditional metrics based on binary relevance, in terms of stability, sensitivity and resemblance of system rankings. More specifically, we compare these metrics using the Buckley/Voorhees stability method, the Voorhees/Buckley swap method and Kendall's rank correlation, with three data sets comprising test collections and submitted runs from NTCIR. Our experiments show that (Average) Normalised Discounted Cumulative Gain at document cut-off l are the best among the rank-based graded-relevance metrics, provided that l is large. On the other hand, if one requires a recall-based graded-relevance metric that is highly correlated with Average Precision, then Q-measure is the best choice. Moreover, these best graded-relevance metrics are at least as stable and sensitive as Average Precision, and are fairly robust to the choice of gain values.