A novel focused crawler based on breadcrumb navigation

  • Authors:
  • Lizhi Ying;Xinhao Zhou;Jian Yuan;Yongfeng Huang

  • Affiliations:
  • Institute of Information Cognition and Intelligence System, Department of Electric Engineering, Tsinghua University, Beijing, China;Institute of Information Cognition and Intelligence System, Department of Electric Engineering, Tsinghua University, Beijing, China;Institute of Information Cognition and Intelligence System, Department of Electric Engineering, Tsinghua University, Beijing, China;Institute of Information Cognition and Intelligence System, Department of Electric Engineering, Tsinghua University, Beijing, China

  • Venue:
  • ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, a novel focused crawler based on Breadcrumb Navigation (BN) is proposed. It mainly leverages Breadcrumb Navigation in the webpages to reconstruct the website structures and resolve focused crawling problems. Different from some previous focused crawlers which use prediction models, the BN crawler firstly samples the web to construct the semantic forest for websites based on Breadcrumb Navigation, and then searches the forest to find the sub-trees relevant to the given topic. After sampling, the BN crawler only needs to download the webpages belonging to the relevant sub-forest. By using this method, the BN crawler costs less time to analyze the Webpage-to-Topic (W2T) similarity but results in a highly efficient performance. The experimental evidences show that the BN crawler significantly outperforms Breadth-First and Best-First crawlers in harvest ratio and can be widely used for most websites.