Application of structured document parsing to focused web crawling

  • Authors:
  • Ahmed Patel;Nikita Schmidt

  • Affiliations:
  • -;-

  • Venue:
  • Computer Standards & Interfaces
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score.