A focused crawling for the web resource discovery using a modified proximal support vector machines

  • Authors:
  • YoungSik Choi;KiJoo Kim;MunSu Kang

  • Affiliations:
  • Department of Computer Engineering, Hankuk Aviation University, Koyang-City, Korea;Department of Computer Engineering, Hankuk Aviation University, Koyang-City, Korea;Department of Computer Engineering, Hankuk Aviation University, Koyang-City, Korea

  • Venue:
  • ICCSA'05 Proceedings of the 2005 international conference on Computational Science and its Applications - Volume Part I
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

With the rapid growth of the World Wide Web, a focused crawling has been increasingly of importance. The goal of the focused crawling is to seek out and collect the pages that are relevant to a predefined set of topics. The determination of the relevance of a page to a specific topic has been addressed as a classification problem. However, when training the classifiers, one can often encounter some difficulties in selecting negative samples. Such difficulties come from the fact that collecting a set of pages relevant to a specific topic is not a classification process by nature. In this paper, we propose a novel focused crawling method using only positive samples to represent a given topic as a form of hyperplane, where we can obtain such representation from a modified Proximal Support Vector Machines. The distance from a page to the hyperplane is used to prioritize the visit order of the page. We demonstrated the performance of the proposed method over the WebKB data set and the Web. The promising results suggest that our proposed method be more effective to the focused crawling problem than the traditional approaches.