NL sampler: random sampling of web documents based on natural language with query hit estimation

Authors:
Daniel Schuster;Alexander Schill
Affiliations:
Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany
Venue:
Proceedings of the 2007 ACM symposium on Applied computing
Year:
2007

Citing 7
Cited 0

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Accessibility of information on the Web

intelligence
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Opportunities for bandwidth adaptation in microsoft office documents

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Traffic analysis of a Web proxy caching hierarchy

IEEE Network: The Magazine of Global Internetworking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Random sampling of documents is a substantial supporting function for research in information science, content-related research (like content adaptation), or social sciences. Looking for an appropriate method to get a random sample of Microsoft Office files for research on presentation sharing applications, we found out, that the two main approaches Random Walk and Random Search are not appropriate to find formatted documents. Both approaches are designed for the purpose of large scale Web analysis and do not fit more special requirements. In this paper, we adopt and extend the Random Search approach first described by Bharat and Broder to a more universal random sampling method based on natural language lexica called NL Sampler, that can be used in a wide range of application domains. It supports parameters like file type or DNS domain restrictions while preserving representativeness. We implemented and evaluated the approach and found a Zipf-like distribution of average hits per query which enables estimation of query hits for a certain set of parameters and thus can be used in a lot more application areas than the approaches previously published. Estimation functions are given for Microsoft Word and PowerPoint documents.