Focused crawling guided by link context

  • Authors:
  • Jingru Dong;Wanli Zuo;Tao Peng

  • Affiliations:
  • Collegae of Computer Science and Technology, Jilin University, Changchun, P.R. China;Collegae of Computer Science and Technology, Jilin University, Changchun, P.R. China;Collegae of Computer Science and Technology, Jilin University, Changchun, P.R. China

  • Venue:
  • AIA'06 Proceedings of the 24th IASTED international conference on Artificial intelligence and applications
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In designing a focused crawler, the choice of strategy for prioritizing unvisited URLs is crucial. There is an intuition that the text surrounding a link or the link context on the HMTL page is a good summary of the target page. But little work has been done to utilize the beneficial link context information about the seed URLs before actual crawling. Motivated by the two observations, we propose a method to collect this kind of resources beforehand and then use it to guide the actual crawling. Experiments show that the proposed approach is reasonable and especially effective to a single-topic crawling, especially at the initial stage.