A training algorithm for optimal margin classifiers
COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Rule-Based Query Language for HTML
DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
The connectivity sonar: detecting site functionality by structural patterns
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Path sharing and predicate evaluation for high-performance XML filtering
ACM Transactions on Database Systems (TODS)
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Coarse-grained classification of web sites by their structural properties
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Web Archiving
An adaptive crawler for locating hidden-Web entry points
Proceedings of the 16th international conference on World Wide Web
Proceedings of the 16th international conference on World Wide Web
Board Forum Crawling: A Web Crawling Method for Web Forum
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Joint optimization of wrapper generation and template detection
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Declarative information extraction using datalog with embedded extraction predicates
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Wraplet: Wrapping Your Web Contents with a Lightweight Language
SITIS '07 Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System
Intelligent and adaptive crawling of web applications for web archiving
ICWE'13 Proceedings of the 13th international conference on Web Engineering
User perception knowledge for socially-aware web document accessibility
UAHCI'13 Proceedings of the 7th international conference on Universal Access in Human-Computer Interaction: user and context diversity - Volume 2
Hi-index | 0.00 |
The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We focus in this PhD work on the crawling and archiving of publicly accessible Web applications, especially those of the social Web. A Web application is any application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind of application currently processed, allowing it to refine the list of URLs to process, and to annotate the archive with information about the structure of crawled content. We add adaptive characteristics to an archival Web crawler: being able to identify when a Web page belongs to a given Web application and applying the appropriate crawling and content extraction methodology.