Intelligent crawling of web applications for web archiving

Authors:
Muhammad Faheem
Affiliations:
Télécom ParisTech, Paris, France
Venue:
Proceedings of the 21st international conference companion on World Wide Web
Year:
2012

Citing 16
Cited 2

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Rule-Based Query Language for HTML

DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Path sharing and predicate evaluation for high-performance XML filtering

ACM Transactions on Database Systems (TODS)
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Coarse-grained classification of web sites by their structural properties

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Web Archiving

Web Archiving
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Classifying web sites

Proceedings of the 16th international conference on World Wide Web
Board Forum Crawling: A Web Crawling Method for Web Forum

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Wraplet: Wrapping Your Web Contents with a Lightweight Language

SITIS '07 Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System

Intelligent and adaptive crawling of web applications for web archiving

ICWE'13 Proceedings of the 13th international conference on Web Engineering
User perception knowledge for socially-aware web document accessibility

UAHCI'13 Proceedings of the 7th international conference on Universal Access in Human-Computer Interaction: user and context diversity - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We focus in this PhD work on the crawling and archiving of publicly accessible Web applications, especially those of the social Web. A Web application is any application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind of application currently processed, allowing it to refine the list of URLs to process, and to annotate the archive with information about the structure of crawled content. We add adaptive characteristics to an archival Web crawler: being able to identify when a Web page belongs to a given Web application and applying the appropriate crawling and content extraction methodology.