Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Keeping Up with the Changing Web
Computer
Mining the Web: Discovering Knowledge from HyperText Data
Mining the Web: Discovering Knowledge from HyperText Data
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Estimating frequency of change
ACM Transactions on Internet Technology (TOIT)
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
Web Dynamics
Network Analysis: Methodological Foundations (Lecture Notes in Computer Science)
Network Analysis: Methodological Foundations (Lecture Notes in Computer Science)
Web Archiving
Web science: an interdisciplinary approach to understanding the web
Communications of the ACM - Web science
IRLbot: scaling to 6 billion pages and beyond
Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
Proceedings of the VLDB Endowment
The SHARC framework for data quality in Web archiving
The VLDB Journal — The International Journal on Very Large Data Bases
Archiving the web using page changes patterns: a case study
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Improving the quality of web archives through the importance of changes
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Coherence-oriented crawling and navigation using patterns for web archives
TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Hi-index | 0.00 |
Web archives preserve the history of Web sites and have high long-term value for media and business analysts. Such archives are maintained by periodically re-crawling entire Web sites of interest. From an archivist's point of view, the ideal case to ensure highest possible data quality of the archive would be to "freeze" the complete contents of an entire Web site during the time span of crawling and capturing the site. Of course, this is practically infeasible. To comply with the politeness specification of a Web site, the crawler needs to pause between subsequent http requests in order to avoid unduly high load on the site's http server. As a consequence, capturing a large Web site may span hours or even days, which increases the risk that contents collected so far are incoherent with the parts that are still to be crawled. This paper introduces a model for identifying coherent sections of an archive and, thus, measuring the data quality in Web archiving. Additionally, we present a crawling strategy that aims to ensure archive coherence by minimizing the diffusion of Web site captures. Preliminary experiments demonstrate the usefulness of the model and the effectiveness of the strategy.