A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Accessibility of information on the Web
intelligence
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
Opportunities for bandwidth adaptation in microsoft office documents
WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Traffic analysis of a Web proxy caching hierarchy
IEEE Network: The Magazine of Global Internetworking
Hi-index | 0.00 |
Random sampling of documents is a substantial supporting function for research in information science, content-related research (like content adaptation), or social sciences. Looking for an appropriate method to get a random sample of Microsoft Office files for research on presentation sharing applications, we found out, that the two main approaches Random Walk and Random Search are not appropriate to find formatted documents. Both approaches are designed for the purpose of large scale Web analysis and do not fit more special requirements. In this paper, we adopt and extend the Random Search approach first described by Bharat and Broder to a more universal random sampling method based on natural language lexica called NL Sampler, that can be used in a wide range of application domains. It supports parameters like file type or DNS domain restrictions while preserving representativeness. We implemented and evaluated the approach and found a Zipf-like distribution of average hits per query which enables estimation of query hits for a certain set of parameters and thus can be used in a lot more application areas than the approaches previously published. Estimation functions are given for Microsoft Word and PowerPoint documents.