Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Pruning long documents for distributed information retrieval
Proceedings of the eleventh international conference on Information and knowledge management
When one sample is not enough: improving text database selection using shrinkage
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Modeling search engine effectiveness for federated search
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Towards better measures: evaluation of estimated resource description quality for distributed IR
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
The PENG system: integrating push and pull for information access
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Foundations and Trends in Information Retrieval
Hi-index | 0.00 |
An open problem for Distributed Information Retrieval is how to represent large document repositories (known as resources) efficiently. To facilitate resource selection, estimated descriptions of each resource are required, especially when faced with non-cooperative distributed environments[1]. Accurate and efficient Resource description estimation is required as this can have an affect on resource selection, and as a consequence retrieval quality. Query-Based Sampling (QBS) has been proposed as a novel solution for resource estimation[2], with proceeding techniques developed therafter[3]. However, the challenge to determine if one QBS technique is better at generating resource description than another is still an unresolved issue. The initial metrics tested and deployed for measuring resource description quality were the Collection Term Frequency ratio (CTF) and Spearman Rank Correlation Coefficient (SRCC)[2]. The former provides an indication of the percentage of terms seen, whilst the later measures the term ranking order, although neither consider the term frequency, which is important for resource selection. We re-examine this problem and consider measuring the quality of a resource description in context to resource selection, where an estimate of the probability of a term given the resource is typically required. We believe a natural measure for comparing the estimated resource against the actual resource is the Kullback-Leibler Divergence (KL) measure. KL addresses the concerns put forward previously, by not over-representing low frequency terms, and also considering term order[2]. In this paper, we re-assess the two previous measures alongside KL. Our preliminary investigation revealed that the former metrics display contradictory results. Whilst, KL suggested a different QBS technique than that prescribed in [2], would provide better estimates. This is a significant result, because it now remains unclear as to which technique will consistently provide better resource descriptions. The remainder of this paper details the three measures, the experimental analysis of our preliminary study and outlines our points of concern along with further research directions.