Structure-driven crawler generation by example

Authors:
Márcio L. A. Vidal;Altigran S. da Silva;Edleno S. de Moura;João M. B. Cavalcanti
Affiliations:
Universidade Federal do Amazonas, Manaus -- Brazil;Universidade Federal do Amazonas, Manaus -- Brazil;Universidade Federal do Amazonas, Manaus -- Brazil;Universidade Federal do Amazonas, Manaus -- Brazil
Venue:
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2006

Citing 17
Cited 15

Identifying syntactic differences between two programs

Software—Practice & Experience
A layered architecture for querying dynamic Web content

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
On the design of a learning crawler for topical resource discovery

ACM Transactions on Information Systems (TOIS)
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Web-DL: an experience in building digital libraries from the web

Proceedings of the eleventh international conference on Information and knowledge management
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Framework for Generating Attribute Extractors for Web Data Sources

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
The use of web structure and content to identify subjectively interesting web usage patterns

ACM Transactions on Internet Technology (TOIT)
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Probabilistic models for focused web crawling

Proceedings of the 6th annual ACM international workshop on Web information and data management
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003

iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Supporting the automatic construction of entity aware search engines

Proceedings of the 10th ACM workshop on Web information and data management
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
Site-Wide Wrapper Induction for Life Science Deep Web Databases

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
A Genre-Aware Approach to Focused Crawling

World Wide Web
Exploiting Tags and Social Profiles to Improve Focused Crawling

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Selective recrawling for object-level vertical search

Proceedings of the 19th international conference on World wide web
Not so creepy crawler: easy crawler generation with standard xml queries

Proceedings of the 19th international conference on World wide web
From one tree to a forest: a unified solution for structured web data extraction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
User browsing behavior-driven web crawling

Proceedings of the 20th ACM international conference on Information and knowledge management
FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
A pattern-based selective recrawling approach for object-level vertical search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.