Identifying syntactic differences between two programs
Software—Practice & Experience
A layered architecture for querying dynamic Web content
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
On the design of a learning crawler for topical resource discovery
ACM Transactions on Information Systems (TOIS)
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
Web-DL: an experience in building digital libraries from the web
Proceedings of the eleventh international conference on Information and knowledge management
Comparing Hierarchical Data in External Memory
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Framework for Generating Attribute Extractors for Web Data Sources
SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
The use of web structure and content to identify subjectively interesting web usage patterns
ACM Transactions on Internet Technology (TOIT)
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Automatic generation of agents for collecting hidden web pages for data extraction
Data & Knowledge Engineering - Special issue: WIDM 2002
Probabilistic models for focused web crawling
Proceedings of the 6th annual ACM international workshop on Web information and data management
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Clustering web pages based on their structure
Data & Knowledge Engineering - Special issue: WIDM 2003
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Supporting the automatic construction of entity aware search engines
Proceedings of the 10th ACM workshop on Web information and data management
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Profile-based focused crawling for social media-sharing websites
Journal on Image and Video Processing
Site-Wide Wrapper Induction for Life Science Deep Web Databases
DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
A Genre-Aware Approach to Focused Crawling
World Wide Web
Exploiting Tags and Social Profiles to Improve Focused Crawling
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Using structured tokens to identify webpages for data extraction
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Selective recrawling for object-level vertical search
Proceedings of the 19th international conference on World wide web
Not so creepy crawler: easy crawler generation with standard xml queries
Proceedings of the 19th international conference on World wide web
From one tree to a forest: a unified solution for structured web data extraction
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
User browsing behavior-driven web crawling
Proceedings of the 20th ACM international conference on Information and knowledge management
FoCUS: learning to crawl web forums
Proceedings of the 21st international conference companion on World Wide Web
A pattern-based selective recrawling approach for object-level vertical search
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.