Tunneling enhanced by web page content block partition for focused crawling: Research Articles

  • Authors:
  • Tao Peng;Changli Zhang;Wanli Zuo

  • Affiliations:
  • College of Computer Science and Technology, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Changchun 130012, China;College of Computer Science and Technology, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Changchun 130012, China;College of Computer Science and Technology, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Changchun 130012, China

  • Venue:
  • Concurrency and Computation: Practice & Experience
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The complexity of web information environments and multiple-topic web pages are negative factors significantly affecting the performance of focused crawling. A highly relevant region in a web page may be obscured because of low overall relevance of that page. Segmenting the web pages into smaller units will significantly improve the performance. Conquering and traversing irrelevant page to reach a relevant one (tunneling) can improve the effectiveness of focused crawling by expanding its reach. This paper presents a heuristic-based method to enhance focused crawling performance. The method uses a Document Object Model (DOM)-based page partition algorithm to segment a web page into content blocks with a hierarchical structure and investigates how to take advantage of block-level evidence to enhance focused crawling by tunneling. Page segmentation can transform an uninteresting multi-topic web page into several single topic context blocks and some of which may be interesting. Accordingly, focused crawler can pursue the interesting content blocks to retrieve the relevant pages. Experimental results indicate that this approach outperforms Breadth-First, Best-First and Link-context algorithm both in harvest rate, target recall and target length. Copyright © 2007 John Wiley & Sons, Ltd.