A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval

Authors:
Mark Baillie;Mark J. Carman;Fabio Crestani
Affiliations:
CIS Dept., University of Strathclyde, Glasgow, UK;Faculty of Informatics, University of Lugano, Lugano, Switzerland;Faculty of Informatics, University of Lugano, Lugano, Switzerland
Venue:
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Year:
2009

Citing 17
Cited 1

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Server selection on the World Wide Web

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Modeling and Managing Content Changes in Text Databases

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Server selection methods in hybrid portal search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating sampling methods for uncooperative collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Updating collection representations for federated search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Accessibility in information retrieval

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Adaptive query-based sampling of distributed collections

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Sample sizes for query probing in uncooperative distributed information retrieval

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development

Federated Search

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The aim of query-based sampling is to obtain a sufficient, representative sample of an underlying (text) collection. Current measures for assessing sample quality are too coarse grain to be informative. This paper outlines a measure of finer granularity based on probabilistic topic models of text. The assumption we make is that a representative sample should capture the broad themes of the underlying text collection. If these themes are not captured, then resource selection will be affected in terms of performance, coverage and reliability. For example, resource selection algorithms that require extrapolation from a small sample of indexed documents to determine which collections are most likely to hold relevant documents may be affected by samples which do not reflect the topical density of a collection. To address this issue we propose to measure the relative entropy between topics obtained in a sample with respect to the complete collection. Topics are both modelled from the collection and inferred in the sample using latent Dirichlet allocation. The paper outlines an analysis and evaluation of this methodology across a number of collections and sampling algorithms.