NL sampler: random sampling of web documents based on natural language with query hit estimation

  • Authors:
  • Daniel Schuster;Alexander Schill

  • Affiliations:
  • Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany

  • Venue:
  • Proceedings of the 2007 ACM symposium on Applied computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Random sampling of documents is a substantial supporting function for research in information science, content-related research (like content adaptation), or social sciences. Looking for an appropriate method to get a random sample of Microsoft Office files for research on presentation sharing applications, we found out, that the two main approaches Random Walk and Random Search are not appropriate to find formatted documents. Both approaches are designed for the purpose of large scale Web analysis and do not fit more special requirements. In this paper, we adopt and extend the Random Search approach first described by Bharat and Broder to a more universal random sampling method based on natural language lexica called NL Sampler, that can be used in a wide range of application domains. It supports parameters like file type or DNS domain restrictions while preserving representativeness. We implemented and evaluated the approach and found a Zipf-like distribution of average hits per query which enables estimation of query hits for a certain set of parameters and thus can be used in a lot more application areas than the approaches previously published. Estimation functions are given for Microsoft Word and PowerPoint documents.