Score standardization for inter-collection comparison of retrieval systems

Authors:
William Webber;Alistair Moffat;Justin Zobel
Affiliations:
The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 18
Cited 19

Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
The maximum entropy method for analyzing retrieval measures

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

SIGIR '06 The 29th Annual International SIGIR Conference
User performance versus precision measures for simple search tasks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The influence of caption features on clickthrough patterns in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hits hits TREC: exploring IR evaluation results with network analysis

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
How well does result relevance predict session satisfaction?

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The relationship between IR effectiveness measures and user satisfaction

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)

Exploring evaluation metrics: GMAP versus MAP

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
Risk-Aware Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Has adhoc retrieval improved since 1994?

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
System scoring using partial prior information

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic set size redux

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A study on performance volatility in information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
IR Evaluation without a Common Set of Topics

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Building a framework for the probability ranking principle by a family of expected weighted rank

ACM Transactions on Information Systems (TOIS)
Improvements that don't add up: ad-hoc retrieval results since 1998

Proceedings of the 18th ACM conference on Information and knowledge management
Measuring system performance and topic discernment using generalized adaptive-weight mean

Proceedings of the 18th ACM conference on Information and knowledge management
Weighted Rank Correlation in Information Retrieval Evaluation

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Click-based evidence for decaying weight distributions in search effectiveness metrics

Information Retrieval
Score aggregation techniques in retrieval experimentation

ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction

Information Retrieval
Latent semantic indexing (LSI) fails for TREC collections

ACM SIGKDD Explorations Newsletter
A multi-collection latent topic model for federated search

Information Retrieval
Measuring the variability in effectiveness of a retrieval system

IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of system evaluation in information retrieval has always been to determine which of a set of systems is superior on a given collection. The tool used to determine system ordering is an evaluation metric such as average precision, which computes relative, collection-specific scores. We argue that a broader goal is achievable. In this paper we demonstrate that, by use of standardization, scores can be substantially independent of a particular collection, allowing systems to be compared even when they have been tested on different collections. Compared to current methods, our techniques provide richer information about system performance, improved clarity in outcome reporting, and greater simplicity in reviewing results from disparate sources.