Compressed collections for simulated crawling

  • Authors:
  • Alessio Orlandi;Sebastiano Vigna

  • Affiliations:
  • Università di Pisa, Italy;Università degli Studi di Milano, Italy

  • Venue:
  • ACM SIGIR Forum
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Collections are a fundamental tool for reproducible evaluation of information retrieval techniques. We describe a new method for distributing the document lengths and term counts (a.k.a. within-document frequencies) of a web snapshot in a highly compressed and nonetheless quickly accessible form. Our main application is reproducibility of the behaviour of focused crawlers: by coupling our collection with the corresponding web graph compressed with WebGraph [3] we make it possible to apply text-based machine learning tools to the collection, while keeping the data set footprint small. We describe a collection based on a crawl of 100 Mpages of the .uk domain, publicly available in bundle with a Java open-source implementation of our techniques.