Compressed collections for simulated crawling

Authors:
Alessio Orlandi;Sebastiano Vigna
Affiliations:
Università di Pisa, Italy;Università degli Studi di Milano, Italy
Venue:
ACM SIGIR Forum
Year:
2008

Citing 7
Cited 0

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Efficient Storage and Retrieval by Content and Address of Static Files

Journal of the ACM (JACM)
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Type less, find more: fast autocompletion search with a succinct index

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A reference collection for web spam

ACM SIGIR Forum
Broadword implementation of rank/select queries

WEA'08 Proceedings of the 7th international conference on Experimental algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Collections are a fundamental tool for reproducible evaluation of information retrieval techniques. We describe a new method for distributing the document lengths and term counts (a.k.a. within-document frequencies) of a web snapshot in a highly compressed and nonetheless quickly accessible form. Our main application is reproducibility of the behaviour of focused crawlers: by coupling our collection with the corresponding web graph compressed with WebGraph [3] we make it possible to apply text-based machine learning tools to the collection, while keeping the data set footprint small. We describe a collection based on a crawl of 100 Mpages of the .uk domain, publicly available in bundle with a Java open-source implementation of our techniques.