A conceptual framework for efficient web crawling in virtual integration contexts

Authors:
Inma Hernández;Hassan A. Sleiman;David Ruiz;Rafael Corchuelo
Affiliations:
University of Seville, Seville, Spain;University of Seville, Seville, Spain;University of Seville, Seville, Spain;University of Seville, Seville, Spain
Venue:
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Year:
2011

Citing 13
Cited 0

A layered architecture for querying dynamic Web content

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Automating Web navigation with the WebVCR

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
On the design of a learning crawler for topical resource discovery

ACM Transactions on Information Systems (TOIS)
Semi-Automatic Wrapper Generation for Commercial Web Sources

Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Discovering and Analyzing World Wide Web Collections

Knowledge and Information Systems
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Towards Deeper Understanding of the Search Interfaces of the Deep Web

World Wide Web
Reinforcement Learning with Classifier Selection for Focused Crawling

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Querying capability modeling and construction of deep web sources

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Crawling the content hidden behind web forms

ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user.