Focused Crawling with Heterogeneous Semantic Information

  • Authors:
  • Rui Huang;Fen Lin;Zhongzhi Shi

  • Affiliations:
  • -;-;-

  • Venue:
  • WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Focused crawlers selectively retrieve Web documents that are relevant to a predefined set of topics. To intelligently make predictions and decisions about relevant URLs and web pages, different topic models have been introduced to represent topic-specific knowledge. Yet it is difficult to support semantic interoperability among different models. Moreover, some manually specified additional semantic information, such as semantic markups and social annotations, could not be effectively used to improve crawling. This paper proposes to boost focused crawling with four kinds of semantic models and semantic information, including thesauruses, categories, ontologies, and folksonomies. A statistical semantic association model is proposed to integrate different semantic models, represent heterogeneous semantic information, and support semantic relevance computation. A focused crawling framework is developed which adopts both keyword based contents and different kinds of additional information for relevance prediction and ranking. Experiments show that the proposed model and framework effectively integrates heterogeneous semantic information for focused crawling.