Test theory for evaluating reliability of IR test collections

Authors:
David Bodoff
Affiliations:
University of Haifa, Graduate School of Management, Jacobs Building, 6th Floor, 31905 Haifa, Israel
Venue:
Information Processing and Management: an International Journal
Year:
2008

Citing 5
Cited 1

Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Blind Men and Elephants: Six Approaches to TREC data

Information Retrieval
Do TREC web collections look like the web?

ACM SIGIR Forum
On the reliability of information retrieval metrics based on graded relevance

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia

On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classical test theory offers theoretically derived reliability measures such as Cronbach's alpha, which can be applied to measure the reliability of a set of Information Retrieval test results. The theory also supports item analysis, which identifies queries that are hampering the test's reliability, and which may be candidates for refinement or removal. A generalization of Classical Test Theory, called Generalizability Theory, provides an even richer set of tools. It allows us to estimate the reliability of a test as a function of the number of queries, assessors (relevance judges), and other aspects of the test's design. One novel aspect of Generalizability Theory is that it allows this estimation of reliability even before the test collection exists, based purely on the numbers of queries and assessors that it will contain. These calculations can help test designers in advance, by allowing them to compare the reliability of test designs with various numbers of queries and relevance assessors, and to spend their limited budgets on a design that maximizes reliability. Empirical analysis shows that in cases for which our data is representative, having more queries is more helpful for reliability than having more assessors. It also suggests that reliability may be improved with a per-document performance measure, as opposed to a document-set based performance measure, where appropriate. The theory also clarifies the implicit debate in IR literature regarding the nature of error in relevance judgments.