An effective relevance prediction algorithm based on hierarchical taxonomy for focused crawling

  • Authors:
  • Zhumin Chen;Jun Ma;Xiaohui Han;Dongmei Zhang

  • Affiliations:
  • School of Computer Science & Technology, Shandong University, Jinan, China;School of Computer Science & Technology, Shandong University, Jinan, China;School of Computer Science & Technology, Shandong University, Jinan, China;School of Computer Science & Technology, Shandong University, Jinan, China

  • Venue:
  • AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

How to give a formal description for a user's interested topic and predict the relevance of unvisited pages to the given topic effectively is a key issue in the design of focused crawlers. However, almost all previous known focused crawlers do the Relevance Predication based on the Flat Information (RPFI) of topic only, i.e. regardless of the context between keywords or topics. In this paper, we first introduce an algorithm to map the topic described in a keyword set or a document written in natural language text to those described in hierarchical topic taxonomy. Then, we propose a novel approach to do the Relevance Predication based on the Hierarchical Context Information (RPHCI) of the taxonomy. Experiments show that the focused crawler based on RPHCI can obtain significantly higher efficiency than those based on RPFI.