Beyond independent relevance: methods and evaluation metrics for subtopic retrieval
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Evaluating evaluation metrics based on the bootstrap
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A new rank correlation coefficient for information retrieval
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Novelty and diversity in information retrieval evaluation
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
An Effectiveness Measure for Ambiguous and Underspecified Queries
ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Expected reciprocal rank for graded relevance
Proceedings of the 18th ACM conference on Information and knowledge management
Redundancy, diversity and interdependent document relevance
ACM SIGIR Forum
System effectiveness, user models, and user utility: a conceptual framework for investigation
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Evaluating diversified search results using per-intent graded relevance
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A query-basis approach to parametrizing novelty-biased cumulative gain
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Hi-index | 0.00 |
In the TREC Web Diversity track, novelty-biased cumulative gain (α-NDCG) is one of the official measures to assess retrieval performance of IR systems. The measure is characterised by a parameter, α, the effect of which has not been thoroughly investigated. We find that common settings of α, i.e. α=0.5, may prevent the measure from behaving as desired when evaluating result diversification. This is because it excessively penalises systems that cover many intents while it rewards those that redundantly cover only few intents. This issue is crucial since it highly influences systems at top ranks. We revisit our previously proposed threshold, suggesting α be set on a query-basis. The intuitiveness of the measure is then studied by examining actual rankings from TREC 09-10 Web track submissions. By varying α according to our query-based threshold, the discriminative power of α-NDCG is not harmed and in fact, our approach improves α-NDCG's robustness. Experimental results show that the threshold for α can turn the measure to be more intuitive than using its common settings.