Learning to crawl: Comparing classification schemes

  • Authors:
  • Gautam Pant;Padmini Srinivasan

  • Affiliations:
  • The University of Utah, Salt Lake City, UT;The University of Iowa, Iowa City, IA

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.