Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Efficient Storage and Retrieval by Content and Address of Static Files
Journal of the ACM (JACM)
The webgraph framework I: compression techniques
Proceedings of the 13th international conference on World Wide Web
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Type less, find more: fast autocompletion search with a succinct index
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A reference collection for web spam
ACM SIGIR Forum
Broadword implementation of rank/select queries
WEA'08 Proceedings of the 7th international conference on Experimental algorithms
Hi-index | 0.00 |
Collections are a fundamental tool for reproducible evaluation of information retrieval techniques. We describe a new method for distributing the document lengths and term counts (a.k.a. within-document frequencies) of a web snapshot in a highly compressed and nonetheless quickly accessible form. Our main application is reproducibility of the behaviour of focused crawlers: by coupling our collection with the corresponding web graph compressed with WebGraph [3] we make it possible to apply text-based machine learning tools to the collection, while keeping the data set footprint small. We describe a collection based on a crawl of 100 Mpages of the .uk domain, publicly available in bundle with a Java open-source implementation of our techniques.