Score aggregation techniques in retrieval experimentation

Authors:
Sri Devi Ravana;Alistair Moffat
Affiliations:
The University of Melbourne, Victoria, Australia;The University of Melbourne, Victoria, Australia
Venue:
ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
Year:
2009

Citing 18
Cited 2

How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Topic prediction based on comparative retrieval rankings

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Why current IR engines fail

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
On GMAP: and other transformations

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Modeling user behavior using a search-engine

Proceedings of the 12th international conference on Intelligent user interfaces
Hits hits TREC: exploring IR evaluation results with network analysis

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Validity and power of t-test for comparing MAP and GMAP

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
How robust are multilingual information retrieval systems?

Proceedings of the 2008 ACM symposium on Applied computing
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
The good and the bad system: does the test collection predict users' effectiveness?

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A new rank correlation coefficient for information retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A new interpretation of average precision

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Precision-at-ten considered redundant

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
The good, the bad, the difficult, and the easy: something wrong with information retrieval evaluation?

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Differentiating between individual class performance in genetic programming fitness for classification with unbalanced data

CEC'09 Proceedings of the Eleventh conference on Congress on Evolutionary Computation
From web data to entities and back

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering

Quantified Score

Hi-index	0.07

Visualization

Abstract

Comparative evaluations of information retrieval systems are based on a number of key premises, including that representative topic sets can be created, that suitable relevance judgements can be generated, and that systems can be sensibly compared based on their aggregate performance over the selected topic set. This paper considers the role of the third of these assumptions -- that the performance of a system on a set of topics can be represented by a single overall performance score such as the average, or some other central statistic. In particular, we experiment with score aggregation techniques including the arithmetic mean, the geometric mean, the harmonic mean, and the median. Using past TREC runs we show that an adjusted geometric mean provides more consistent system rankings than the arithmetic mean when a significant fraction of the individual topic scores are close to zero, and that score standardization (Webber et al., SIGIR 2008) achieves the same outcome in a more consistent manner.