Intelligent and adaptive crawling of web applications for web archiving

Authors:
Muhammad Faheem;Pierre Senellart
Affiliations:
Institut Mines-Télécom, Télécom ParisTech, CNRS LTCI, Paris, France;Institut Mines-Télécom, Télécom ParisTech, CNRS LTCI, Paris, France,The University of Hong Kong, Hong Kong
Venue:
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Year:
2013

Citing 14
Cited 1

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Regression testing for wrapper maintenance

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Automatic repairing of web wrappers

Proceedings of the 3rd international workshop on Web information and data management
An Automated Change Detection Algorithm for HTML Documents Based on Semantic Hierarchies

Proceedings of the 17th International Conference on Data Engineering
Schema-guided wrapper maintenance for web-data extraction

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Path sharing and predicate evaluation for high-performance XML filtering

ACM Transactions on Database Systems (TODS)
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Web Archiving

Web Archiving
Board Forum Crawling: A Web Crawling Method for Web Forum

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations

Data & Knowledge Engineering
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Web-scale information extraction with vertex

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Intelligent crawling of web applications for web archiving

Proceedings of the 21st international conference companion on World Wide Web

Demonstrating intelligent crawling and archiving of web applications

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web sites are dynamic in nature with content and structure changing overtime. Many pages on the Web are produced by content management systems (CMSs) such as WordPress, vBulletin, or phpBB. Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on (leading to suboptimal crawling strategies) and whatever structured content is contained in Web pages (resulting in page-level archives whose content is hard to exploit). We present in this paper an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications (e.g., the pages served by a CMS). Because the AAH is aware of the Web application currently crawled, it is able to refine the list of URLs to process and to extend the archive with semantic information about extracted content. To deal with possible changes in structure of Web applications, our AAH includes an adaptation module that makes crawling resilient to small changes in the structure of Web site. We show the value of our approach by comparing the output and efficiency of the AAH with respect to regular Web crawlers, also in the presence of structure change.