Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

  • Authors:
  • G. Monroe;J. French;A. Powell

  • Affiliations:
  • -;-;-

  • Venue:
  • HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

DTDMI02In the context of information retrieval, traditional collection selection algorithms have been widely studied. These algorithms utilize language models, a representation of the contents of each text collection over which selection is to be performed, but these language models cannot always be easily acquired. Query-based sampling is a technique by which these language models are discovered by interacting with a collection and observing the results. Previous work has shown query-based sampling to be a viable solution to the problem of discovering the contents of text collections when the information cannot be otherwise obtained. However, the characteristics of language models of WWW collections created using query-based sampling have not yet been studied. This work evaluates two query- based sampling techniques for building language models of three World Wide Web collections. Experimental results support the effectiveness of query-based sampling as a solution for building language models of web collections. This work also proposes a metric by which it may be possible to determine the point at which further sampling of a given web collection can cease. This metric is used along with other metrics used in previous work to determine the fidelity of these language models.