Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Regression testing for wrapper maintenance
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Automatic repairing of web wrappers
Proceedings of the 3rd international workshop on Web information and data management
An Automated Change Detection Algorithm for HTML Documents Based on Semantic Hierarchies
Proceedings of the 17th International Conference on Data Engineering
Schema-guided wrapper maintenance for web-data extraction
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Path sharing and predicate evaluation for high-performance XML filtering
ACM Transactions on Database Systems (TODS)
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Web Archiving
Board Forum Crawling: A Web Crawling Method for Web Forum
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Data & Knowledge Engineering
Wrapper maintenance: a machine learning approach
Journal of Artificial Intelligence Research
Web-scale information extraction with vertex
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Intelligent crawling of web applications for web archiving
Proceedings of the 21st international conference companion on World Wide Web
Demonstrating intelligent crawling and archiving of web applications
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Web sites are dynamic in nature with content and structure changing overtime. Many pages on the Web are produced by content management systems (CMSs) such as WordPress, vBulletin, or phpBB. Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on (leading to suboptimal crawling strategies) and whatever structured content is contained in Web pages (resulting in page-level archives whose content is hard to exploit). We present in this paper an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications (e.g., the pages served by a CMS). Because the AAH is aware of the Web application currently crawled, it is able to refine the list of URLs to process and to extend the archive with semantic information about extracted content. To deal with possible changes in structure of Web applications, our AAH includes an adaptation module that makes crawling resilient to small changes in the structure of Web site. We show the value of our approach by comparing the output and efficiency of the AAH with respect to regular Web crawlers, also in the presence of structure change.