Research methodology in studies of assessor effort for information retrieval evaluation

Authors:
Ben Carterette;James Allan
Affiliations:
University of Massachusetts Amherst, Amherst, MA;University of Massachusetts Amherst, Amherst, MA
Venue:
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Year:
2007

Citing 20
Cited 0

Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Analyses of multiple evidence combination

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On the effectiveness of evaluating retrieval systems in the absence of relevance judgments

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A unified model for metasearch, pooling, and system evaluation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Forming test collections with no system pooling

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A geometric interpretation and analysis of R-precision

Proceedings of the 14th ACM international conference on Information and knowledge management
Incremental test collections

Proceedings of the 14th ACM international conference on Information and knowledge management
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Learning a ranking from pairwise preferences

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

As evaluation is an important but difficult part of information retrieval system design and experimentation, evaluation questions have been the subject of much research. An "evaluation study" is an investigation into some aspect of evaluation. These types of studies typically experiment on ranked results from actual retrieval systems, most often those that were submitted to TREC tracks. We argue that the standard of evidence in these types of studies should be increased to the level required of text retrieval studies, by testing on multiple data sets, multiple subsets of data, and comparison to baselines using hypothesis testing. We demonstrate that baseline performance on the standard data sets is quite high, necessitating strong evidence to support claims.