A tool for link-based web page classification

  • Authors:
  • Inma Hernández;Carlos R. Rivero;David Ruiz;Rafael Corchuelo

  • Affiliations:
  • University of Seville, Seville, Spain;University of Seville, Seville, Spain;University of Seville, Seville, Spain;University of Seville, Seville, Spain

  • Venue:
  • CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.