Learnable topic-specific web crawler

  • Authors:
  • A. Rungsawang;N. Angkawattanawit

  • Affiliations:
  • Department of Computer Engineering, Massive Information & Knowledge Engineering, Kasetsart University, 50 Pahol-Yothin Rd, Jatujak, Bangkok 10900, Thailand;Department of Computer Engineering, Massive Information & Knowledge Engineering, Kasetsart University, 50 Pahol-Yothin Rd, Jatujak, Bangkok 10900, Thailand

  • Venue:
  • Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages as possible, and most of them only detail the approaches of the first crawling. However, no one has ever mentioned some important questions, such as how the crawler performs during the next crawling attempts, can the crawler learn from experience to crawl more relevant web pages in an incremental way, etc. In this paper, we present an algorithm that covers the discussion of both the first and the consecutive crawling. For efficient result of the next crawling, we derive the information of previous crawling attempts to build some knowledge bases: starting URLs, topic keywords and URL prediction. These knowledge bases are used to build the experience of the learnable topic-specific web crawler to produce better result for the next crawling. Preliminary evaluation illustrates that the proposed web crawler can learn from experience to better collect the web pages under interest during the early period of consecutive crawling attempts.