Accelerating Structured Web Crawling without Losing Data

  • Authors:
  • Boutros R. El-Gamil;Werner Winiwarter

  • Affiliations:
  • Vienna University of Technology, Institute for Software Technology and Interactive Systems, Favoritenstraße 9-11, 1040 Vienna, Austria;University of Vienna, Research Group Data Analytics and Computing, Währinger Straße 29, 1090 Vienna, Austria

  • Venue:
  • Proceedings of International Conference on Information Integration and Web-based Applications & Services
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Size of retrieved data versus crawling time formulate a well-known dilemma in the structured Web crawling community. The real challenge within this dilemma is to optimize the settings of a given wrapper to obtain maximum available data in shortest possible time. In this paper, we try to tune these settings, by introducing a threaded algorithm that guarantees accessing all available detail pages within crawling scope; and using this algorithm, we try to reduce the time consumed by the crawler, via simple adjustments of sleeping time after each detail page visit.