A case for automatic system evaluation

Authors:
Claudia Hauff;Djoerd Hiemstra;Leif Azzopardi;Franciska de Jong
Affiliations:
University of Twente, Enschede, The Netherlands;University of Twente, Enschede, The Netherlands;University of Glasgow, Glasgow, UK;University of Twente, Enschede, The Netherlands
Venue:
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Year:
2010

Citing 15
Cited 6

Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
On the effectiveness of evaluating retrieval systems in the absence of relevance judgments

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Methods for ranking information retrieval systems without relevance judgments

Proceedings of the 2003 ACM symposium on Applied computing
Scaling IR-system evaluation using term relevance sets

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental test collections

Proceedings of the 14th ACM international conference on Information and knowledge management
Automatic ranking of information retrieval systems using data fusion

Information Processing and Management: an International Journal
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Using the structure of overlap between search results to rank retrieval systems without relevance judgments

Information Processing and Management: an International Journal
Hits hits TREC: exploring IR evaluation results with network analysis

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Performance prediction using spatial autocorrelation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Using Multiple Query Aspects to Build Test Collections without Human Relevance Judgments

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
A few good topics: Experiments in topic set reduction for retrieval evaluation

ACM Transactions on Information Systems (TOIS)
Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions

ECIR'07 Proceedings of the 29th European conference on IR research

Retrieval system evaluation: automatic evaluation versus incomplete judgments

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Aspects and analysis of patent test collections

PaIR '10 Proceedings of the 3rd international workshop on Patent information retrieval
Exploring ant colony optimisation for adaptive interactive search

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Recent developments in information retrieval

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
An uncertainty-aware query selection model for evaluation of IR systems

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
On Using Fewer Topics in Information Retrieval Evaluations

Proceedings of the 2013 Conference on the Theory of Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ranking a set retrieval systems according to their retrieval effectiveness without relying on relevance judgments was first explored by Soboroff et al. [13]. Over the years, a number of alternative approaches have been proposed, all of which have been evaluated on early TREC test collections. In this work, we perform a wider analysis of system ranking estimation methods on sixteen TREC data sets which cover more tasks and corpora than previously. Our analysis reveals that the performance of system ranking estimation approaches varies across topics. This observation motivates the hypothesis that the performance of such methods can be improved by selecting the “right” subset of topics from a topic set. We show that using topic subsets improves the performance of automatic system ranking methods by 26% on average, with a maximum of 60%. We also observe that the commonly experienced problem of underestimating the performance of the best systems is data set dependent and not inherent to system ranking estimation. These findings support the case for automatic system evaluation and motivate further research.