Schema driven and topic specific web crawling

  • Authors:
  • Qi Guo;Hang Guo;Zhiqiang Zhang;Jing Sun;Jianhua Feng

  • Affiliations:
  • Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China

  • Venue:
  • DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a new approach to discover and extract topic-specific hypertext resources from the WWW. The method, called schema driven and topical crawling, allows a user to define schema and extracting rules for a specific domain of interests. It supports automatically search and extract schema-relevant web pages from the web. Different from common approaches that surf solely on web pages, our approach supports crawler to surf on a virtual network composed by concept instances and relationships. To achieve such a goal, we design an architecture that integrates several techniques including web extractor, meta-search engine and query expansion, and provide a toolkit to support it.