Blind Men and Elephants: Six Approaches to TREC data
Information Retrieval
On Collection Size and Retrieval Effectiveness
Information Retrieval
On ranking the effectiveness of searches
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling the Score Distributions of Relevant and Non-relevant Documents
ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Simulating simple user behavior for system effectiveness evaluation
Proceedings of the 20th ACM international conference on Information and knowledge management
On smoothing average precision
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
On the measurement of test collection reliability
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Bias-variance decomposition of ir evaluation
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Modelling Score Distributions Without Actual Scores
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Bias-variance analysis in estimating true query model for information retrieval
Information Processing and Management: an International Journal
Hi-index | 0.00 |
We explore the notion, put forward by Cormack & Lynam and Robertson, that we should consider a document collection used for Cranfield-style experiments as a sample from some larger population of documents. In this view, any per-topic metric (such as average precision) should be regarded as an estimate of that metric's true value for that topic in the full population, and therefore as carrying its own per-topic variance or estimate precision or noise. As in the two mentioned papers, we explore this notion by simulating other samples from the same large population. We investigate different ways of performing this simulation. One use of this analysis is to refine the notion of statistical significance of a difference between two systems (in most such analyses, each per-topic measurement is treated as equally precise). We propose a mixed-effects model method to measure significance, and compare it experimentally with the traditional t-test.