The SHARC framework for data quality in Web archiving

Authors:
Dimitar Denev;Arturas Mazeika;Marc Spaniol;Gerhard Weikum
Affiliations:
Max Planck Institute for Informatics, Saarbrücken, Germany 66123;Max Planck Institute for Informatics, Saarbrücken, Germany 66123;Max Planck Institute for Informatics, Saarbrücken, Germany 66123;Max Planck Institute for Informatics, Saarbrücken, Germany 66123
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2011

Citing 37
Cited 0

Logical modeling of temporal data

SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
Supporting multiple view maintenance policies

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Best-effort cache synchronization with source cooperation

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Keeping Up with the Changing Web

Computer
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Web Dynamics

Web Dynamics
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Scheduling Algorithms for Web Crawling

LA-WEBMEDIA '04 Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Web Archiving

Web Archiving
Consistency-preserving caching of dynamic database content

Proceedings of the 16th international conference on World Wide Web
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Value complete, column complete, predicate complete

The VLDB Journal — The International Journal on Very Large Data Bases
Design trade-offs for search engine caching

ACM Transactions on the Web (TWEB)
Estimating the Change of Web Pages

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
The web changes everything: understanding the dynamics of web content

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
Data quality in web archiving

Proceedings of the 3rd workshop on Information credibility on the web
Fractional PageRank Crawler: Prioritizing URLs Efficiently for Crawling Important Pages Early

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
An Economic Model for Self-Tuned Cloud Caching

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Improving the performance of focused web crawlers

Data & Knowledge Engineering
Estimating the rate of web page updates

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Graph-based seed selection for web-scale crawlers

Proceedings of the 18th ACM conference on Information and knowledge management
SHARC: framework for quality-conscious web archiving

Proceedings of the VLDB Endowment
NEAR-Miner: mining evolution associations of web site directories for efficient maintenance of web archives

Proceedings of the VLDB Endowment
Selective recrawling for object-level vertical search

Proceedings of the 19th international conference on World wide web
Freshness matters: in flowers, food, and web authority

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather coherent captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies toward better quality with given resources. We define data quality measures, characterize their properties, and develop a suite of quality-conscious scheduling strategies for archive crawling. Our framework includes single-visit and visit---revisit crawls. Single-visit crawls download every page of a site exactly once in an order that aims to minimize the "blur" in capturing the site. Visit---revisit strategies revisit pages after their initial downloads to check for intermediate changes. The revisiting order aims to maximize the "coherence" of the site capture(number pages that did not change during the capture). The quality notions of blur and coherence are formalized in the paper. Blur is a stochastic notion that reflects the expected number of page changes that a time-travel access to a site capture would accidentally see, instead of the ideal view of a instantaneously captured, "sharp" site. Coherence is a deterministic quality measure that counts the number of unchanged and thus coherently captured pages in a site snapshot. Strategies that aim to either minimize blur or maximize coherence are based on prior knowledge of or predictions for the change rates of individual pages. Our framework includes fairly accurate classifiers for change predictions. All strategies are fully implemented in a testbed and shown to be effective by experiments with both synthetically generated sites and a periodic crawl series for different Web sites.