Exploiting Interclass Rules for Focused Crawling

  • Authors:
  • Ismail Sengor Altingovde;Ozgur Ulusoy

  • Affiliations:
  • Bilkent University;Bilkent University

  • Venue:
  • IEEE Intelligent Systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

A focused crawler is an agent that concentrates on a particular target topic and tries to visit and gather only relevant pages from the Web. A crucial issue for a focused crawler is the underlying heuristic for deciding the page to visit next. The authors propose a rule-based approach to improve a baseline focused crawler's harvest rate and coverage. The baseline focused crawler employs a canonical topic taxonomy to train a naïve-Bayesian classifier, which then helps score unseen URLs. The authors explore using simple rules derived from interclass (topic) linkage patterns to decide the crawler's next move. The rule-based approach also enhances the baseline crawler in supporting tunneling. In initial performance results, the rule-based crawler improved the harvest rate and coverage of the baseline crawler.