SHARC: framework for quality-conscious web archiving

  • Authors:
  • Dimitar Denev;Arturas Mazeika;Marc Spaniol;Gerhard Weikum

  • Affiliations:
  • Max Planck Institute for Informatics Campus, Saarbrücken, Germany;Max Planck Institute for Informatics Campus, Saarbrücken, Germany;Max Planck Institute for Informatics Campus, Saarbrücken, Germany;Max Planck Institute for Informatics Campus, Saarbrücken, Germany

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather sharp captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. We define quality measures, characterize their properties, and derive a suite of quality-conscious scheduling strategies for archive crawling. It is assumed that change rates of Web pages can be statistically predicted based on page types, directory depths, and URL names. We develop a stochastically optimal crawl algorithm for the offline case where all change rates are known. We generalize the approach into an online algorithm that detect information on a Web site while it is crawled. For dating a site capture and for assessing its quality, we propose several strategies that revisit pages after their initial downloads in a judiciously chosen order. All strategies are fully implemented in a testbed, and shown to be effective by experiments with both synthetically generated sites and a daily crawl series for a medium-sized site.