Logical modeling of temporal data
SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
Supporting multiple view maintenance policies
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Synchronizing a database to improve freshness
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Best-effort cache synchronization with source cooperation
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Keeping Up with the Changing Web
Computer
Estimating frequency of change
ACM Transactions on Internet Technology (TOIT)
Web Dynamics
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Scheduling Algorithms for Web Crawling
LA-WEBMEDIA '04 Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Web Archiving
Consistency-preserving caching of dynamic database content
Proceedings of the 16th international conference on World Wide Web
Effective change detection using sampling
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
RankMass crawler: a crawler with high personalized pagerank coverage guarantee
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
IRLbot: scaling to 6 billion pages and beyond
Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
Value complete, column complete, predicate complete
The VLDB Journal — The International Journal on Very Large Data Bases
Design trade-offs for search engine caching
ACM Transactions on the Web (TWEB)
Estimating the Change of Web Pages
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
The web changes everything: understanding the dynamics of web content
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
Purely URL-based topic classification
Proceedings of the 18th international conference on World wide web
Proceedings of the 3rd workshop on Information credibility on the web
Fractional PageRank Crawler: Prioritizing URLs Efficiently for Crawling Important Pages Early
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
An Economic Model for Self-Tuned Cloud Caching
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Improving the performance of focused web crawlers
Data & Knowledge Engineering
Estimating the rate of web page updates
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Graph-based seed selection for web-scale crawlers
Proceedings of the 18th ACM conference on Information and knowledge management
SHARC: framework for quality-conscious web archiving
Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment
Selective recrawling for object-level vertical search
Proceedings of the 19th international conference on World wide web
Freshness matters: in flowers, food, and web authority
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Temporal shingling for version identification in web archives
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Hi-index | 0.00 |
Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather coherent captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies toward better quality with given resources. We define data quality measures, characterize their properties, and develop a suite of quality-conscious scheduling strategies for archive crawling. Our framework includes single-visit and visit---revisit crawls. Single-visit crawls download every page of a site exactly once in an order that aims to minimize the "blur" in capturing the site. Visit---revisit strategies revisit pages after their initial downloads to check for intermediate changes. The revisiting order aims to maximize the "coherence" of the site capture(number pages that did not change during the capture). The quality notions of blur and coherence are formalized in the paper. Blur is a stochastic notion that reflects the expected number of page changes that a time-travel access to a site capture would accidentally see, instead of the ideal view of a instantaneously captured, "sharp" site. Coherence is a deterministic quality measure that counts the number of unchanged and thus coherently captured pages in a site snapshot. Strategies that aim to either minimize blur or maximize coherence are based on prior knowledge of or predictions for the change rates of individual pages. Our framework includes fairly accurate classifiers for change predictions. All strategies are fully implemented in a testbed and shown to be effective by experiments with both synthetically generated sites and a periodic crawl series for different Web sites.