Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

Authors:
G. Monroe;J. French;A. Powell
Affiliations:
-;-;-
Venue:
HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3
Year:
2002

Citing 0
Cited 8

Modeling web data

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
When one sample is not enough: improving text database selection using shrinkage

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Towards better measures: evaluation of estimated resource description quality for distributed IR

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Classification-aware hidden-web text database selection

ACM Transactions on Information Systems (TOIS)
Mapping geographic coverage of the web

Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems
Federated Search

Foundations and Trends in Information Retrieval
How much of the web is archived?

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Federated search in the wild: the combined power of over a hundred search engines

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

DTDMI02In the context of information retrieval, traditional collection selection algorithms have been widely studied. These algorithms utilize language models, a representation of the contents of each text collection over which selection is to be performed, but these language models cannot always be easily acquired. Query-based sampling is a technique by which these language models are discovered by interacting with a collection and observing the results. Previous work has shown query-based sampling to be a viable solution to the problem of discovering the contents of text collections when the information cannot be otherwise obtained. However, the characteristics of language models of WWW collections created using query-based sampling have not yet been studied. This work evaluates two query- based sampling techniques for building language models of three World Wide Web collections. Experimental results support the effectiveness of query-based sampling as a solution for building language models of web collections. This work also proposes a metric by which it may be possible to determine the point at which further sampling of a given web collection can cease. This metric is used along with other metrics used in previous work to determine the fidelity of these language models.