On test collections for adaptive information retrieval

Authors:
Ellen M. Voorhees
Affiliations:
National Institute of Standards and Technology, 100 Bureau Drive, STOP 8940, Gaithersburg, MD 20899-8940, USA
Venue:
Information Processing and Management: an International Journal
Year:
2008

Citing 14
Cited 9

On sample sizes for non-matched-pair IR experiments

Information Processing and Management: an International Journal
The significance of the Cranfield tests on index languages

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
The pragmatics of information retrieval experimentation, revisited

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Variations in relevance assessments and the measurement of retrieval effectiveness

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Comparing interactive information retrieval systems across sites: the TREC-6 interactive track matrix experiment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Do batch and user evaluations give the same results?

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Blind Men and Elephants: Six Approaches to TREC data

Information Retrieval
Why batch and user evaluations do not give the same results

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic language and information processing: rethinking evaluation

Natural Language Engineering
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Methods for Evaluating Interactive Information Retrieval Systems with Users

Foundations and Trends in Information Retrieval
Usage based effectiveness measures: monitoring application performance in information retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Evaluating whole-page relevance

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating search systems using result page context

Proceedings of the third symposium on Information interaction in context
On the potential search effectiveness of MeSH (medical subject headings) terms

Proceedings of the third symposium on Information interaction in context
Biomedical information retrieval: the BioTracer approach

ITBAM'10 Proceedings of the First international conference on Information technology in bio- and medical informatics
Supporting biomedical information retrieval: the bioTracer approach

Transactions on large-scale data- and knowledge-centered systems IV
An extensible personal photograph collection for graded relevance assessments and user simulation

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Comparison of chemical similarity measures using different numbers of query structures

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional Cranfield test collections represent an abstraction of a retrieval task that Sparck Jones calls the ''core competency'' of retrieval: a task that is necessary, but not sufficient, for user retrieval tasks. The abstraction facilitates research by controlling for (some) sources of variability, thus increasing the power of experiments that compare system effectiveness while reducing their cost. However, even within the highly-abstracted case of the Cranfield paradigm, meta-analysis demonstrates that the user/topic effect is greater than the system effect, so experiments must include a relatively large number of topics to distinguish systems' effectiveness. The evidence further suggests that changing the abstraction slightly to include just a bit more characterization of the user will result in a dramatic loss of power or increase in cost of retrieval experiments. Defining a new, feasible abstraction for supporting adaptive IR research will require winnowing the list of all possible factors that can affect retrieval behavior to a minimum number of essential factors.