Accelerating Structured Web Crawling without Losing Data

Authors:
Boutros R. El-Gamil;Werner Winiwarter
Affiliations:
Vienna University of Technology, Institute for Software Technology and Interactive Systems, Favoritenstraße 9-11, 1040 Vienna, Austria;University of Vienna, Research Group Data Analytics and Computing, Währinger Straße 29, 1090 Vienna, Austria
Venue:
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Year:
2013

Citing 10
Cited 0

Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes

DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
WICCAP: From Semi-structured Data to Structured Data

ECBS '04 Proceedings of the 11th IEEE International Conference and Workshop on Engineering of Computer-Based Systems
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
On the feasibility of geographically distributed web crawling

Proceedings of the 3rd international conference on Scalable information systems
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Deep web integrated systems: current achievements and open issues

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

Size of retrieved data versus crawling time formulate a well-known dilemma in the structured Web crawling community. The real challenge within this dilemma is to optimize the settings of a given wrapper to obtain maximum available data in shortest possible time. In this paper, we try to tune these settings, by introducing a threaded algorithm that guarantees accessing all available detail pages within crawling scope; and using this algorithm, we try to reduce the time consumed by the crawler, via simple adjustments of sleeping time after each detail page visit.