Intelligent and adaptive crawling of web applications for web archiving

  • Authors:
  • Muhammad Faheem;Pierre Senellart

  • Affiliations:
  • Institut Mines-Télécom, Télécom ParisTech, CNRS LTCI, Paris, France;Institut Mines-Télécom, Télécom ParisTech, CNRS LTCI, Paris, France,The University of Hong Kong, Hong Kong

  • Venue:
  • ICWE'13 Proceedings of the 13th international conference on Web Engineering
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web sites are dynamic in nature with content and structure changing overtime. Many pages on the Web are produced by content management systems (CMSs) such as WordPress, vBulletin, or phpBB. Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on (leading to suboptimal crawling strategies) and whatever structured content is contained in Web pages (resulting in page-level archives whose content is hard to exploit). We present in this paper an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications (e.g., the pages served by a CMS). Because the AAH is aware of the Web application currently crawled, it is able to refine the list of URLs to process and to extend the archive with semantic information about extracted content. To deal with possible changes in structure of Web applications, our AAH includes an adaptation module that makes crawling resilient to small changes in the structure of Web site. We show the value of our approach by comparing the output and efficiency of the AAH with respect to regular Web crawlers, also in the presence of structure change.