Test theory for assessing IR test collections

Authors:
David Bodoff;Pu Li
Affiliations:
University of Haifa, Haifa, Israel;State University of New York, at Buffalo, NY
Venue:
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2007

Citing 6
Cited 9

How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Blind Men and Elephants: Six Approaches to TREC data

Information Retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
If I Had a Million Queries

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Empirical justification of the gain and discount function for nDCG

Proceedings of the 18th ACM conference on Information and knowledge management
CiteData: a new multi-faceted dataset for evaluating personalized search performance

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Alternative assessor disagreement and retrieval depth

Proceedings of the 21st ACM international conference on Information and knowledge management
On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A new statistical strategy for pooling: ELI

Information Processing Letters
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

How good is an IR test collection? A series of papers in recent years has addressed the question by empirically enumerating the consistency of performance comparisons using alternate subsets of the collection. In this paper we propose using Test Theory, which is based on analysis of variance and is specifically designed to assess test collections. Using the method, we not only can measure test reliability after the fact, but we can estimate the test collection's reliability before it is even built or used. We can also determine an optimal allocation of resources before the fact, e.g. whether to invest in more judges or queries. The method, which is in widespread use in the field of educational testing, complements data-driven approaches to assessing test collections. Whereas the data-driven method focuses on test results, test theory focuses on test designs. It offers unique practical results, as well as insights about the variety and implications of alternative test designs.