A study on performance volatility in information retrieval

Authors:
Mehdi Hosseini
Affiliations:
University College London, London, United Kingdom
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 2
Cited 0

The TREC 2005 robust track

ACM SIGIR Forum
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common practice in comparative evaluation of information retrieval (IR) systems is to create a test collection comprising a set of topics (queries), a document corpus, and relevance judgments, and to monitor the performance of retrieval systems over such a collection. A typical evaluation of a system involves computing a performance metric, e.g., Average Precision (AP), for each topic and then using the average performance metric, e.g., Mean Average Precision (MAP) to express the overall system performance. However, averages do not capture all the important aspects of system performance, and used alone may not thoroughly express system effectiveness, i.e., average of performance can mask large variance in individual topic effectiveness. The author hypothesis is that, in addition to the average of overall performance, attention needs to be paid to how a system performance varies across topics. This variability can be measured by calculating the standard deviation (SD) of individual performance scores. We refer to this performance variation as Volatility.