SHARC: framework for quality-conscious web archiving

Authors:
Dimitar Denev;Arturas Mazeika;Marc Spaniol;Gerhard Weikum
Affiliations:
Max Planck Institute for Informatics Campus, Saarbrücken, Germany;Max Planck Institute for Informatics Campus, Saarbrücken, Germany;Max Planck Institute for Informatics Campus, Saarbrücken, Germany;Max Planck Institute for Informatics Campus, Saarbrücken, Germany
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 19
Cited 7

Logical modeling of temporal data

SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
Supporting multiple view maintenance policies

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Best-effort cache synchronization with source cooperation

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Keeping Up with the Changing Web

Computer
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Scheduling Algorithms for Web Crawling

LA-WEBMEDIA '04 Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
Web Archiving

Web Archiving
Consistency-preserving caching of dynamic database content

Proceedings of the 16th international conference on World Wide Web
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Value complete, column complete, predicate complete

The VLDB Journal — The International Journal on Very Large Data Bases
Design trade-offs for search engine caching

ACM Transactions on the Web (TWEB)
The web changes everything: understanding the dynamics of web content

Proceedings of the Second ACM International Conference on Web Search and Data Mining

The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases
Archiving the web using page changes patterns: a case study

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Improving the quality of web archives through the importance of changes

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Coherence-oriented crawling and navigation using patterns for web archives

TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Archiving the relaxed consistency web

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CUVIM: extracting fresh information from social network

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather sharp captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. We define quality measures, characterize their properties, and derive a suite of quality-conscious scheduling strategies for archive crawling. It is assumed that change rates of Web pages can be statistically predicted based on page types, directory depths, and URL names. We develop a stochastically optimal crawl algorithm for the offline case where all change rates are known. We generalize the approach into an online algorithm that detect information on a Web site while it is crawled. For dating a site capture and for assessing its quality, we propose several strategies that revisit pages after their initial downloads in a judiciously chosen order. All strategies are fully implemented in a testbed, and shown to be effective by experiments with both synthetically generated sites and a daily crawl series for a medium-sized site.